Relation classification (sometimes called 'extraction') requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well served by public data sets. In response, we present IndoRE, a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned 'silver' instances. We release the dataset for future research.
翻译:关系分类(有时称为“Expractice ”) 需要可靠的数据集来微调大型语言模型以及评估。 数据收集对印度语言具有挑战性,因为印度语言在形式上和形态上各不相同,而且不同于像英语这样的资源丰富的语言。 尽管最近对印度语言的深层基因化模型感兴趣,但公共数据集仍然无法很好地满足关系分类的要求。 作为回应,我们提出了IndoRE,这是一个21K实体的数据集,用三种印度语言加英语标注的黄金句子。我们从基于多语言的BERT(mBERT)系统开始,该系统收集实体的位置和类型信息并提供竞争性的单一语言关系分类。我们利用这个系统,探索和比较不同语言之间的转移机制。特别是,我们研究昂贵黄金案例与翻译和校正的“ilver”案例之间的准确性效率权衡。我们为未来研究发布数据集。