Batch data analytics is a growing application for Large Language Models (LLMs). LLMs enable users to perform a wide range of natural language tasks, such as classification, entity extraction, and translation, over large datasets. However, LLM inference is highly costly and slow: for example, an NVIDIA L4 GPU running Llama3-8B can only process 6 KB of text per second, taking about a day to handle 15 GB of data; processing a similar amount of data costs around 10KonOpenAI′sGPT−4o.Inthispaper,weproposenoveltechniquesthatcansignificantlyreducethecostofLLMcallsforrelationaldataanalyticsworkloads.Ourkeycontributionisdevelopingefficientalgorithmsforreorderingtherowsandthefieldswithineachrowofaninputtabletomaximizekey−value(KV)cachereusewhenperformingLLMserving.Assuch,ourapproachcanbeeasilyappliedtoexistinganalyticssystemsandservingplatforms.Ourevaluationshowsthatoursolutioncanyieldupto3.4ximprovementinjobcompletiontimeonabenchmarkofdiverseLLM−basedqueriesusingLlama3models.Oursolutionalsoachievesa32
@article{liu2025_2403.05821,
title={ Optimizing LLM Queries in Relational Data Analytics Workloads },
author={ Shu Liu and Asim Biswal and Amog Kamsetty and Audrey Cheng and Luis Gaspar Schroeder and Liana Patel and Shiyi Cao and Xiangxi Mo and Ion Stoica and Joseph E. Gonzalez and Matei Zaharia },
journal={arXiv preprint arXiv:2403.05821},
year={ 2025 }
}