Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. However, analyzing live dialogues in real-time necessitates low-latency processing systems, making it impractical to deploy models with billions of parameters due to latency constraints. As a result, practitioners often prefer smaller models with millions of parameters, trained on high-quality, human-annotated datasets. Yet, curating such datasets is both time-consuming and costly. Consequently, there is a growing need to combine the scalability of LLM-generated labels with the precision of human annotations, enabling fine-tuned smaller models to achieve both higher speed and accuracy comparable to larger models. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more. To mitigate the impact of labeling errors from LLMs -- the primary source of inaccuracies in student models -- we propose a noise-reduced preference learning loss. Experimental results demonstrate that our method significantly improves accuracy across utterance-level dialogue tasks, including sentiment detection (over ), dialogue act classification (over ), etc.
View on arXiv@article{liu2025_2503.05620, title={ Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings }, author={ Xuanqing Liu and Luyang Kong and Wei Niu and Afshin Khashei and Belinda Zeng and Steve Johnson and Jon Jay and Davor Golac and Matt Pope }, journal={arXiv preprint arXiv:2503.05620}, year={ 2025 } }