Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers

Legal invoice review is a costly, inconsistent, and time-consuming process, traditionally performed by Legal Operations, Lawyers or Billing Specialists who scrutinise billing compliance line by line. This study presents the first empirical comparison of Large Language Models (LLMs) against human invoice reviewers - Early-Career Lawyers, Experienced Lawyers, and Legal Operations Professionals-assessing their accuracy, speed, and cost-effectiveness. Benchmarking state-of-the-art LLMs against a ground truth set by expert legal professionals, our empirically substantiated findings reveal that LLMs decisively outperform humans across every metric. In invoice approval decisions, LLMs achieve up to 92% accuracy, surpassing the 72% ceiling set by experienced lawyers. On a granular level, LLMs dominate line-item classification, with top models reaching F-scores of 81%, compared to just 43% for the best-performing human group. Speed comparisons are even more striking - while lawyers take 194 to 316 seconds per invoice, LLMs are capable of completing reviews in as fast as 3.6 seconds. And cost? AI slashes review expenses by 99.97%, reducing invoice processing costs from an average of
View on arXiv@article{whitehouse2025_2504.02881, title={ Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers }, author={ Nick Whitehouse and Nicole Lincoln and Stephanie Yiu and Lizzie Catterson and Rivindu Perera }, journal={arXiv preprint arXiv:2504.02881}, year={ 2025 } }