ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2511.02062
20
0

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

3 November 2025
Yuting Yang
Tiancheng Yuan
Jamal Hashim
Thiago Garrett
Jeffrey Qian
Ann Zhang
Yifan Wang
Weijia Song
Ken Birman
    VLM
ArXiv (abs)PDFHTML
Main:10 Pages
18 Figures
Bibliography:5 Pages
Appendix:4 Pages
Abstract

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

View on arXiv
Comments on this paper