Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.03353
Cited By
Towards Pareto Optimal Throughput in Small Language Model Serving
4 April 2024
Pol G. Recasens
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Alaa Youssef
Jordi Torres
Josep Ll. Berral
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Pareto Optimal Throughput in Small Language Model Serving"
4 / 4 papers shown
Title
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
Yaxiong Wu
Sheng Liang
Chen Zhang
Y. Wang
Y. Zhang
Huifeng Guo
Ruiming Tang
Y. Liu
KELM
36
0
0
22 Apr 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
41
0
0
11 Mar 2025
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
144
365
0
13 Mar 2023
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,791
0
17 Sep 2019
1