For a multilingual podcast streaming service, it is critical to be able to deliver relevant content to all users independent of language. Podcast content relevance is conventionally determined using various metadata sources. However, with the increasing quality of speech recognition in many languages, utilizing automatic transcriptions to provide better content recommendations becomes possible. In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine. Specifically, we explore how increasing transcription noise influences topics obtained from transcriptions in Danish; a low resource language. First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators. We then observe how the cosine similarities decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic embeddings from the transcriptions.
翻译:对于多语种播客流流传服务而言,关键是要能够向独立于语言的所有用户提供相关内容。播客内容的相关性通常由各种元数据来源确定。然而,随着多种语言语音识别质量的提高,利用自动抄录提供更好的内容建议成为可能。在这项工作中,我们探索了在应用到自动语音识别引擎创建的记录誊本时, " 冷淡迪里赫莱特分配 " 专题模型的稳健性。具体地说,我们探索了越来越多的抄录噪音如何影响丹麦文抄录中的专题;一种低资源语言。首先,我们观察了从自动抄录中嵌入的专题和播客创作者撰写的播客描述之间的共性相似性分数基线。然后我们观察了如何随着调录噪音的增加而减少共弦相似性,并得出结论,即使自动语音识别记录誊本是错误的,仍然有可能从抄录中获得高质量的专题嵌入。