We present a multilingual bag-of-entities model that effectively boosts the performance of zero-shot cross-lingual text classification by extending a multilingual pre-trained language model (e.g., M-BERT). It leverages the multilingual nature of Wikidata: entities in multiple languages representing the same concept are defined with a unique identifier. This enables entities described in multiple languages to be represented using shared embeddings. A model trained on entity features in a resource-rich language can thus be directly applied to other languages. Our experimental results on cross-lingual topic classification (using the MLDoc and TED-CLDC datasets) and entity typing (using the SHINRA2020-ML dataset) show that the proposed model consistently outperforms state-of-the-art models.
翻译:我们提出了一个多语种实体袋模型,通过扩展一个多语种预先培训的多语言模式(例如M-BERT),有效促进零点跨语种文本分类的绩效;它利用维基数据多语种的性质:代表同一概念的多种语言实体的定义具有独特的识别特征;这让以多种语言描述的实体能够使用共用嵌入器进行代表;因此,一个关于资源丰富语言实体特征的培训模型可以直接应用于其他语言;我们在跨语种主题分类(使用MLDoc和TED-CLDC数据集)和实体打字(使用SHINRA20-ML数据集)方面的实验结果显示,拟议的模式一贯优于最新模式。