We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning. Researchers can benefit from WikiPDA as a practical tool for studying Wikipedia's content across its 299 language editions in interpretable ways, via an easy-to-use library publicly available at https://github.com/epfl-dlab/WikiPDA.
翻译:我们推出基于维基百科的多球Drichlet 分配(Wikipeglot Dirichlet Retal (Wikipeglot Dirichlet) (WikiPDA), 这是一种跨语言主题模式, 学会将用任何语言写成的维基百科文章作为共同的一组语言独立主题的发行品。 它利用维基百科文章相互连接并被映射到维基数据知识库中的概念, 这样, 当作为链接的包, 文章本身就具有语言独立性。 WikiPDA 工作分两步走两步走, 首先使用完成的矩阵将链接包压缩, 然后训练一个标准的单一语言主题模式。 人类评价显示, WikiPDA 产生比单一语言文本LDA 更连贯的主题, 从而免费提供交叉质量。 我们在两个应用程序中展示了维基PDA的效用: 28 维基百科版本中的主题偏差研究, 以及跨语言的分类。 最后, 我们强调维基PDA的零点语言传输能力, 用于新语言的模型不作任何微调。 研究人员可以从Wlifliblimablika/flika 的实用工具受益。