Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to be either single-language or suffer from a huge, but extremely sparse vocabulary. Both issues can be addressed by transfer learning. In this paper, we introduce a zero-shot cross-lingual topic model. Our model learns topics on one language (here, English), and predicts them for unseen documents in different languages (here, Italian, French, German, and Portuguese). We evaluate the quality of the topic predictions for the same document in different languages. Our results show that the transferred topics are coherent and stable across languages, which suggests exciting future research directions.
翻译:许多数据集(例如评论、论坛、新闻等)同时以多种语文同时存在,它们都包含相同的内容,但语言差异使得无法使用传统的、基于字包的主题模型。模型必须是单一语文的,或受一个庞大但极为稀少的词汇的影响。这两个问题都可以通过转移学习来解决。在本文件中,我们引入了一个零弹跨语言主题模型。我们的模型学习一种语言(这里是英语)的专题,并预测不同语言(这里是意大利语、法语、德语和葡萄牙语)的不可见文件。我们用不同语言评估同一文件的专题预测质量。我们的结果显示,传输的专题在各种语言之间是连贯和稳定的,这表明了令人振奋的未来研究方向。