As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models -- all without the heavy computational cost of retraining across every modality and language.
翻译:随着深度学习模型的演进,新的应用与挑战正迅速涌现。以往依赖单一模态(如文本、图像或音频)的任务,如今通过多模态数据间的无缝交互得以丰富。这些关联弥合了信息鸿沟:图像可将文本内容可视化呈现,而音频能为图像增添语境。研究人员已开发出众多多模态模型,但大多依赖于跨多模态的资源密集型训练。类似地,将这些模型扩展至新语言也常采用同样资源密集的训练策略。本文提出一种多模态与多语言架构CACARA,通过涌现对齐学习进行训练,能够在无需完全重新训练的情况下,将新模态无缝集成至现有双模态/多模态模型中。本研究的突破性在于证明这种涌现对齐范式可从单语言训练中解锁多语言能力。仅通过对新融入的模态在英语对齐数据上进行微调,我们的模型即可支持超过100种语言,而无需对文本编码器进行显式的多语言预训练或调优。此类涌现的多模态与多语言特性以高效方式获得,在保持先前学习知识的同时,其训练成本与单语言模型相当。我们的策略在音频到文本检索的R@1指标上实现了最高14.24个百分点的提升,优于当前最先进的多模态模型——且无需承担跨所有模态和语言重新训练的高昂计算成本。