Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
翻译:时间感知是全模态大语言模型的一项基本能力,尤其对于理解长视频和回答复杂问题至关重要。先前的研究主要针对视觉-语言场景,侧重于显式时间定位问题,例如识别视觉事件发生的时间或确定特定时间发生的事件。然而,这些方法通常对音频模态利用不足,且忽视了跨模态的隐式时间定位——例如,识别角色说话时视觉上呈现的内容,或确定视觉事件发生时对应的语音内容——尽管此类跨模态时间关系在现实场景中普遍存在。本文提出ChronusOmni,一种旨在增强显式和隐式视听时间定位能力的时间感知全模态大语言模型。首先,我们在每个时间单元中将基于文本的时间戳标记与视觉及音频表示交错融合,实现了跨模态的统一时间建模。其次,为强化正确的时间顺序和细粒度时间推理,我们结合了强化学习与专门设计的奖励函数。此外,我们构建了ChronusAV数据集,该数据集具有时间精确性、模态完整性和跨模态对齐性,以支持视听时间定位任务的训练与评估。实验结果表明,ChronusOmni在ChronusAV数据集上取得了超过30%的性能提升,达到最先进水平,并在其他时间定位基准测试的大多数指标上获得最优结果。这凸显了我们的模型在跨模态时间感知方面的强大能力,同时保持了通用的视频与音频理解性能。