ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2510.22101
127
1

Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search

25 October 2025
Kayhan Behdin
Qingquan Song
Sriram Vasudevan
Jian Sheng
Xiaojing Ma
Zhi-min Zhou
C. Zhu
Guoyao Li
Chanh Nguyen
Sayan Ghosh
Hejian Sang
Ata Fatahi Baarzi
Sundara Raman Ramachandran
Xiaoqing Wang
Qing Lan
V. Sodha
Qi Guo
Caleb Johnson
Zhipeng Wang
Fedor Borisyuk
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)
Main:10 Pages
2 Figures
Bibliography:2 Pages
10 Tables
Abstract

Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to 40%40\%40% while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to 101010x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by 101010x in a real-world deployment, while meeting our quality bar.

View on arXiv
Comments on this paper