0
0

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

Sunghee Jung
Donghun Lee
Shinbok Lee
Gaeun Seo
Daniel Lee
Byeongil Ko
Junrae Cho
Kihyun Kim
Eunggyun Kim
Myeongcheol Shin
Abstract

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

View on arXiv
@article{jung2025_2504.02882,
  title={ DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models },
  author={ Sunghee Jung and Donghun Lee and Shinbok Lee and Gaeun Seo and Daniel Lee and Byeongil Ko and Junrae Cho and Kihyun Kim and Eunggyun Kim and Myeongcheol Shin },
  journal={arXiv preprint arXiv:2504.02882},
  year={ 2025 }
}
Comments on this paper