Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource languages is impractical and expensive. To overcome this issue, we propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup, which significantly improves the baseline with distant supervision. For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish, called RELX. We also provide the RELX-Distant dataset, which includes hundreds of thousands of sentences with relations from Wikipedia and Wikidata collected by distant supervision for these languages. Our code and data are available at: https://github.com/boun-tabi/RELX
翻译:关系分类是信息提取的关键议题之一,可用于建立知识库或提供有用的回答问题信息。当前关系分类方法主要侧重于英语,需要大量附有人文说明的培训数据。为低资源语言创建和说明大量培训数据不切实际,费用很高。为解决这一问题,我们提议了两种跨语言关系分类模式:一种基于多语言BERT的基线模型和新的多语言培训前设置,这在遥远的监督下大大改进了基线。为了评估,我们推出了一套新的英文、法文、德文、西班牙文和土耳其文的跨语言关系分类公共基准数据集,称为RELX。我们还提供了RELX-远地点数据集,其中包括数十万个与维基百科和维基数据的关系的句子,由远处监督为这些语言收集的维基数据。我们的代码和数据见:https://github.com/boun-tabi/RELX。我们的代码和数据见https://github.com/boun-tabi/RELX。