NEXUS-O: An Omni-Perceptive and -Interactive Model for Language, Audio, and Vision

  • Che Liu
  • , Yingji Zhang
  • , Dong Zhang
  • , Weijie Zhang
  • , Chenggong Gong
  • , Yu Lu
  • , Shilin Zhou
  • , Ziliang Gan
  • , Ziao Wang
  • , Haipang Wu
  • , Ji Liu
  • , Andre Freitas
  • , Qifan Wang
  • , Zenglin Xu
  • , Rongjunchen Zhang
  • , Yong Dai*
  • *Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Human beings perceive the real world through a spectrum of sensory modalities, encompassing auditory, visual, and linguistic faculties. This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular, end-to-end framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, NEXUS-O. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings: (1) In the visual understanding task, NEXUSO exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our realworld ASR testset, NEXUS-O achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), NEXUS-O is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.
Original languageEnglish
Title of host publicationProceedings of the 33rd ACM International Conference on Multimedia
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages10787–10796
Number of pages10
ISBN (Electronic)9798400720352
DOIs
Publication statusPublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, ACMMM25 - Dublin Royal Convention Centre, Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025
https://whova.com/embedded/event/sa54pNCpHUFy1OTIEiEzceQu5kPuSm3dYlEnqAJdV4o%3D/?utc_source=ems (Conference program)
https://acmmm2025.org/ (Conference website)
https://dl.acm.org/doi/proceedings/10.1145/3746027 (Conference proceedings)

Publication series

NameProceedings of the ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery

Conference

Conference33rd ACM International Conference on Multimedia, ACMMM25
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25
Internet address

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 9 - Industry, Innovation, and Infrastructure
    SDG 9 Industry, Innovation, and Infrastructure

User-Defined Keywords

  • Mutimodal Large Language Models
  • Omni-modalities

Fingerprint

Dive into the research topics of 'NEXUS-O: An Omni-Perceptive and -Interactive Model for Language, Audio, and Vision'. Together they form a unique fingerprint.

Cite this