ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.01879
56
0

Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

26 February 2025
Che Liu
Yingji Zhang
D. Zhang
Weijie Zhang
Chenggong Gong
Haohan Li
Yu Lu
Shilin Zhou
Yue Lu
Ziliang Gan
Ziao Wang
Junwei Liao
Haipang Wu
Ji Liu
André Freitas
Qifan Wang
Z. Xu
Rongjuncheng Zhang
Yong Dai
    AuLLM
ArXivPDFHTML
Abstract

Human beings perceive the real world through a spectrum of sensory modalities, encompassing auditory, visual, and linguistic faculties. The journey towards achieving Artificial General Intelligence (AGI) necessitates the development of models that can emulate these multifaceted perceptual capabilities and comprehensively understand these diversified data. To this end, we introduce \textbf{Nexus-O}, an industry-level \textbf{omni-perceptive and -interactive} model capable of efficiently processing Audio, Image, Video, and Text data in any combination and output audio/text in an end-to-end way. We systematically investigate Nexus-O by addressing three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario speech datasets? For the first question, we design and pre-train Nexus-O based on the vision-language model, rather than the language model. By pre-training the model over high-quality synthetic audio data, our model is capable of tri-modal perception and interaction. For the second question, we introduce a new audio testbed, Nexus-O-audio, comprising diverse Automatic Speech Recognition (ASR) samples, spanning various real-world scenarios, such as corporate meetings and live stream. For the third question, we design the speech data synthesis pipeline to obtain high-quality speech training datasets, covering various real-world scenarios. Comprehensive experimentation and an in-depth analysis of tri-modal alignment over latent space demonstrate the advantages of our model on downstream tasks.

View on arXiv
@article{liu2025_2503.01879,
  title={ Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision },
  author={ Che Liu and Yingji Zhang and Dong Zhang and Weijie Zhang and Chenggong Gong and Haohan Li and Yu Lu and Shilin Zhou and Yue Lu and Ziliang Gan and Ziao Wang and Junwei Liao and Haipang Wu and Ji Liu and André Freitas and Qifan Wang and Zenglin Xu and Rongjuncheng Zhang and Yong Dai },
  journal={arXiv preprint arXiv:2503.01879},
  year={ 2025 }
}
Comments on this paper