v1v2 (latest)

From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

26 September 2025

Main:11 Pages

7 Figures

Bibliography:4 Pages

25 Tables

Appendix:17 Pages

Abstract

Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the expensive model, failing to adapt during inference. To address this, we propose Inter-Cascade, an online, interactive framework that transforms the strong model from a temporary helper into a long-term teacher. In our approach, when the strong model resolves a deferred query, it generates a generalized, reusable problem-solving strategy. These strategies are stored in a dynamic repository and retrieved via similarity matching to augment the weak model's context for future queries. This enables the weak model to learn on the job without expensive parameter fine-tuning. We theoretically show that this mechanism improves the weak model's confidence calibration. Empirically, Inter-Cascade outperforms standard cascades on multiple benchmarks, improving weak model and overall system accuracy by up to 33.06 percent and 6.35 percent, while reducing strong model calls by up to 48.05 percent and saving fee by up to 49.63 percent. Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a general, scalable framework applicable to both open-source and API-based LLMs.

View on arXiv

Comments on this paper