ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.12391
  4. Cited By
LLM Inference Serving: Survey of Recent Advances and Opportunities

LLM Inference Serving: Survey of Recent Advances and Opportunities

17 July 2024
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
ArXiv (abs)PDFHTML

Papers citing "LLM Inference Serving: Survey of Recent Advances and Opportunities"

18 / 18 papers shown
Title
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
Xiumei Deng
Zehui Xiong
Binbin Chen
Dong In Kim
Mérouane Debbah
H. Vincent Poor
FedML
136
0
0
04 Nov 2025
Reasoning Language Model Inference Serving Unveiled: An Empirical Study
Reasoning Language Model Inference Serving Unveiled: An Empirical Study
Qi Li
Junpan Wu
Xiang Liu
Yuxin Wang
Z. Li
Zhenheng Tang
Yuhan Chen
Shaohuai Shi
Xiaowen Chu
ReLMLRM
232
1
0
21 Oct 2025
Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps
Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps
Jiayang Xu
Xiangjie Huang
Zijie Li
Zili Meng
84
0
0
01 Oct 2025
Large Reasoning Models Learn Better Alignment from Flawed Thinking
Large Reasoning Models Learn Better Alignment from Flawed Thinking
ShengYun Peng
Eric Michael Smith
Ivan Evtimov
Song Jiang
Pin-Yu Chen
Hongyuan Zhan
Haozhu Wang
Duen Horng Chau
Mahesh Pasupuleti
Jianfeng Chi
OffRLLRM
144
3
0
01 Oct 2025
Prompt-Aware Scheduling for Low-Latency LLM Serving
Prompt-Aware Scheduling for Low-Latency LLM Serving
Yiheng Tao
Yihe Zhang
M. Dearing
Xin Wang
Yuping Fan
Z. Lan
LRM
117
0
0
25 Sep 2025
HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
Zahra Yousefijamarani
Xinglu Wang
Qian Wang
Morgan Lindsay Heisler
Taha Shabani
...
Xiaolong Bai
Jiannan Wang
Ying Xiong
Yong Zhang
Zhenan Fan
128
2
0
21 Aug 2025
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
Adarsh Prasad Behera
J. Champati
Roberto Morabito
Sasu Tarkoma
J. Gross
180
5
0
06 Jun 2025
Shape it Up! Restoring LLM Safety during Finetuning
Shape it Up! Restoring LLM Safety during Finetuning
ShengYun Peng
Pin-Yu Chen
Jianfeng Chi
Seongmin Lee
Duen Horng Chau
LLMAG
284
3
0
22 May 2025
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study
Xinyi Hou
Jiahao Han
Yanjie Zhao
Haoyu Wang
253
5
0
05 May 2025
Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
Fernando Koch
Aladin Djuhera
Alecio Binotto
299
0
0
19 Mar 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM InferenceIEEE International Conference on Cloud Computing (CLOUD), 2025
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
216
17
0
11 Mar 2025
Efficient Algorithms for Verifying Kruskal Rank in Sparse Linear Regression and Related Applications
Fengqin Zhou
322
21
0
06 Mar 2025
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
Gopi Krishnan Rajbahadur
G. Oliva
Dayi Lin
Ahmed E. Hassan
288
3
0
28 Jan 2025
Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM
Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM
Haiyue Ma
Jian Liu
Ronny Krashinsky
179
0
0
10 Oct 2024
RouteLLM: Learning to Route LLMs with Preference Data
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong
Amjad Almahairi
Vincent Wu
Wei-Lin Chiang
Tianhao Wu
Joseph E. Gonzalez
M. W. Kadous
Ion Stoica
502
203
0
26 Jun 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
Ramachandran Ramjee
Ashish Panwar
VLM
357
59
0
07 May 2024
FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
Xupeng Miao
Xupeng Miao
Xinhao Cheng
Vineeth Kada
Mengdi Wu
...
April Yang
April Yang
Yingcheng Wang
Colin Unger
Zhihao Jia
MoE
470
16
0
29 Feb 2024
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts ModelsInternational Conference on Learning Representations (ICLR), 2024
Keisuke Kamahori
Tian Tang
Yile Gu
Kan Zhu
Baris Kasikci
404
41
0
10 Feb 2024
1