OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

3 July 2025

Ramchalam Kinattinkara Ramakrishnan

Zhaocong Yuan

Shaojie Zhuo

Chen Feng

Yicheng Lin

Chenzheng Su

Xiaopeng Zhang

ArXiv (abs)PDF HTML

Main:9 Pages

7 Figures

Bibliography:3 Pages

13 Tables

Appendix:7 Pages

Abstract

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

View on arXiv

@article{ramakrishnan2025_2507.02659,
  title={ OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding },
  author={ Ramchalam Kinattinkara Ramakrishnan and Zhaocong Yuan and Shaojie Zhuo and Chen Feng and Yicheng Lin and Chenzheng Su and Xiaopeng Zhang },
  journal={arXiv preprint arXiv:2507.02659},
  year={ 2025 }
}

Comments on this paper