30
11

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
Hangyu Guo
Yiming Liang
Jiaheng Liu
Zekun Wang
Jian Yang
Siwei Wu
Xingwei Qu
Jinjie Shi
Xinyue Zhang
Zhenzhu Yang
Xiangzhou Wang
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
Abstract

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (this https URL).

View on arXiv
@article{li2025_2409.15272,
  title={ OmniBench: Towards The Future of Universal Omni-Language Models },
  author={ Yizhi Li and Ge Zhang and Yinghao Ma and Ruibin Yuan and Kang Zhu and Hangyu Guo and Yiming Liang and Jiaheng Liu and Zekun Wang and Jian Yang and Siwei Wu and Xingwei Qu and Jinjie Shi and Xinyue Zhang and Zhenzhu Yang and Xiangzhou Wang and Zhaoxiang Zhang and Zachary Liu and Emmanouil Benetos and Wenhao Huang and Chenghua Lin },
  journal={arXiv preprint arXiv:2409.15272},
  year={ 2025 }
}
Comments on this paper