ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.18677
  4. Cited By
Splitwise: Efficient generative LLM inference using phase splitting

Splitwise: Efficient generative LLM inference using phase splitting

30 November 2023
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
ArXivPDFHTML

Papers citing "Splitwise: Efficient generative LLM inference using phase splitting"

10 / 110 papers shown
Title
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM
  Serving
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
F. Strati
Sara Mcallister
Amar Phanishayee
Jakub Tarnawski
Ana Klimovic
20
24
0
04 Mar 2024
InferCept: Efficient Intercept Support for Augmented Large Language
  Model Inference
InferCept: Efficient Intercept Support for Augmented Large Language Model Inference
Reyna Abhyankar
Zijian He
Vikranth Srivatsa
Hao Zhang
Yiying Zhang
RALM
29
11
0
02 Feb 2024
T3: Transparent Tracking & Triggering for Fine-grained Overlap of
  Compute & Collectives
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Suchita Pati
Shaizeen Aga
Mahzabeen Islam
Nuwan Jayasena
Matthew D. Sinclair
12
12
0
30 Jan 2024
Inference without Interference: Disaggregate LLM Inference for Mixed
  Downstream Workloads
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
Cunchen Hu
Heyang Huang
Liangliang Xu
Xusheng Chen
Jiang Xu
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
DRL
13
58
0
20 Jan 2024
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized
  Large Language Model Serving
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong
Shengyu Liu
Junda Chen
Jianbo Hu
Yibo Zhu
Xuanzhe Liu
Xin Jin
Hao Zhang
8
168
0
18 Jan 2024
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Yilong Zhao
Chien-Yu Lin
Kan Zhu
Zihao Ye
Lequn Chen
Size Zheng
Luis Ceze
Arvind Krishnamurthy
Tianqi Chen
Baris Kasikci
MQ
18
130
0
29 Oct 2023
Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
Wenqi Jiang
Marco Zeller
R. Waleffe
Torsten Hoefler
Gustavo Alonso
39
14
0
15 Oct 2023
Optimizing Distributed ML Communication with Fused
  Computation-Collective Operations
Optimizing Distributed ML Communication with Fused Computation-Collective Operations
Kishore Punniyamurthy
Khaled Hamidouche
Bradford M. Beckmann
FedML
13
3
0
11 May 2023
Fast Distributed Inference Serving for Large Language Models
Fast Distributed Inference Serving for Large Language Models
Bingyang Wu
Yinmin Zhong
Zili Zhang
Gang Huang
Xuanzhe Liu
Xin Jin
22
90
0
10 May 2023
FlexGen: High-Throughput Generative Inference of Large Language Models
  with a Single GPU
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
138
208
0
13 Mar 2023
Previous
123