ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.08741
40
0

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

11 March 2025
Letian Zhang
Quan Cui
Bingchen Zhao
Cheng Yang
    MLLM
    SyDa
ArXivPDFHTML
Abstract

The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available atthis https URL.

View on arXiv
@article{zhang2025_2503.08741,
  title={ Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis },
  author={ Letian Zhang and Quan Cui and Bingchen Zhao and Cheng Yang },
  journal={arXiv preprint arXiv:2503.08741},
  year={ 2025 }
}
Comments on this paper