The growth of cross-lingual pre-trained models has enabled NLP tools to rapidly generalize to new languages. While these models have been applied to tasks involving entities, their ability to explicitly predict typological features of these entities across languages has not been established. In this paper, we present a unified cross-lingual fine-grained entity typing model capable of handling over 100 languages and analyze this model's ability to generalize to languages and entities unseen during training. We train this model on cross-lingual training data collected from Wikipedia hyperlinks in multiple languages (training languages). During inference, our model takes an entity mention and context in a particular language (test language, possibly not in the training languages) and predicts fine-grained types for that entity. Generalizing to new languages and unseen entities are the fundamental challenges of this entity typing setup, so we focus our evaluation on these settings and compare against simple yet powerful string match baselines. Experimental results show that our approach outperforms the baselines on unseen languages such as Japanese, Tamil, Arabic, Serbian, and Persian. In addition, our approach substantially improves performance on unseen entities (even in unseen languages) over the baselines, and human evaluation shows a strong ability to predict relevant types in these settings.
翻译:跨语言预科培训模式的发展使得跨语言预科培训模式能够迅速推广到新语言。 这些模式已经应用到涉及实体的任务中,但它们明确预测这些实体跨语言类型特征的能力尚未建立。 在本文件中,我们提出了一个统一的跨语言精细实体打字模式,能够处理超过100种语言,并分析这一模式在培训期间对语言和实体进行普及的能力。我们用多种语言(培训语言)从维基百科超链接收集的跨语言培训数据(培训语言)来培训这一模式。在推断中,我们的模型用特定语言(测试语言,可能不是培训语言)提及一个实体,并预测该实体的细化类型。通用新语言和隐蔽实体是该实体打字的基本挑战,因此我们集中对这些背景进行评估,对照简单而有力的字符串匹配基线进行比较。实验结果表明,我们的方法超越了日语、泰米尔语、阿拉伯语、塞尔维亚语和波斯语等看不见语言的基线。此外,我们的方法大大改进了(即使是在看不见语言环境下)对基线和人类评估的强烈能力。