ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17134
76
0

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

22 May 2025
Chaochen Gao
Xing Wu
Zijia Lin
Debing Zhang
Songlin Hu
    SyDa
ArXivPDFHTML
Abstract

High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

View on arXiv
@article{gao2025_2505.17134,
  title={ LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions },
  author={ Chaochen Gao and Xing Wu and Zijia Lin and Debing Zhang and Songlin Hu },
  journal={arXiv preprint arXiv:2505.17134},
  year={ 2025 }
}
Comments on this paper