ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

29 September 2025

Aasheesh Singh

Vishal Vaddina

Dagnachew Birru

ArXiv (abs)PDF HTML

Main:4 Pages

1 Figures

Bibliography:1 Pages

2 Tables

Abstract

We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Un- like standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.

View on arXiv

Comments on this paper