OTAS: An Elastic Transformer Serving System via Token AdaptationIEEE Conference on Computer Communications (INFOCOM), 2024 |
Fairness in Serving Large Language ModelsUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023 |
Splitwise: Efficient generative LLM inference using phase splittingInternational Symposium on Computer Architecture (ISCA), 2023 |
SpotServe: Serving Generative Large Language Models on Preemptible
InstancesInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023 |
Efficient Memory Management for Large Language Model Serving with
PagedAttentionSymposium on Operating Systems Principles (SOSP), 2023 |
Resource Management for GPT-based Model Deployed on Clouds: Challenges,
Solutions, and Future DirectionsInternational Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), 2023 |