Nexus：一种全感知与全交互的语言、音频与视觉模型 (Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision)

Che Liu,Yingji Zhang,Dong Zhang,Weijie Zhang,Chenggong Gong,Yu Lu,Shilin Zhou,Ziliang Gan,Ziao Wang,Haipang Wu,Ji Liu,André Freitas,Qifan Wang,Zenglin Xu,Rongjuncheng Zhang,Yong Dai

from arxiv, Project: https://github.com/HiThink-Research/NEXUS-O

This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

翻译：本研究提出一种工业级的全模态大语言模型（LLM）流程，通过整合听觉、视觉与语言模态，以应对三模态数据集有限、计算成本高昂及特征对齐复杂等挑战。该流程包含三个核心组件：首先，采用模块化框架，支持灵活配置多种编码器-LLM-解码器架构；其次，设计轻量级训练策略，基于前沿视觉语言模型Qwen2.5-VL进行音频-语言对齐预训练，从而避免视觉专用模态的高成本预训练；第三，构建音频合成流程，可从多样化现实场景生成高质量音频-文本数据，支持自动语音识别与语音到语音对话等应用。基于此，我们推出工业级全模态大语言模型Nexus。大量实验验证了该流程的有效性，主要发现如下：（1）在视觉理解任务中，Nexus相较于其骨干模型Qwen2.5-VL-7B表现出更优性能，证实了训练策略的高效性；（2）在英语口语问答任务中，该模型在LLaMA Q.基准测试中优于同期竞争模型（即MiniCPM-o2.6-7B）；（3）在真实场景自动语音识别测试集上，Nexus取得卓越性能，表明其在实际场景中的鲁棒性；（4）在语音到文本翻译任务中，本模型超越Qwen2-Audio-Instruct-7B；（5）在文本到语音任务中，基于预训练声码器（如Fishspeech1.4或CosyVoice2.0），Nexus在Seed-TTS基准测试中与其骨干声码器性能相当；（6）对三模态对齐的深入分析表明，引入音频模态能增强视觉与语言表征间的对齐效果。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日