CLaSp: In-Context Layer Skip for Self-Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 |
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing SystemInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025 |
Accelerating Retrieval-Augmented GenerationInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 |
BATON: Enhancing Batch-wise Inference Efficiency for Large Language
Models via Dynamic Re-batchingThe Web Conference (WWW), 2024 |
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM
InferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024 |
A Survey: Collaborative Hardware and Software Design in the Era of Large
Language ModelsIEEE Circuits and Systems Magazine (IEEE CSM), 2024 |
Geometric Collaborative Filtering with ConvergenceInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024 |