We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabilities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained several LRID-Nets with varying modality dropout configuration and test them with various combinations of input modalities. The experiment results demonstrate that using multimodal input improves the performance. The results also suggest that adopting modality dropout does not degrade performance of the model when there are full modality inputs while enabling the model to handle missing modality cases to some extent.
翻译:我们建议采用多式歌唱语言分类模式,同时使用音频内容和文本元数据。拟议模式LRID-Net采用音频信号和语言概率矢量,根据元数据和产出估计目标语言的概率。可以选择的是,LRID-Net使用模式辍学者处理缺失模式的便利。在实验中,我们培训了数个不同模式辍学配置的LRID-Net,并用各种投入模式组合测试了这些网络。实验结果表明,使用多种模式投入提高了绩效。结果还表明,采用模式辍学不会降低模式在有完整模式投入时的绩效,同时使模式能够在某种程度上处理缺失模式案例。