In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available, where the task is identical to that of a low-resource target language. However, collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns. This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries. First, we construct a dictionary-based heterogeneous graph (DHG) from bilingual dictionaries. This opens the possibility to use graph neural networks for cross-lingual transfer. The remaining challenge is the heterogeneity of DHG because multiple languages are considered. To address this challenge, we propose dictionary-based heterogeneous graph neural network (DHGNet) that effectively handles the heterogeneity of DHG by two-step aggregations, which are word-level and language-level aggregations. Experimental results demonstrate that our method outperforms pretrained models even though it does not access to large corpora. Furthermore, it can perform well even though dictionaries contain many incorrect translations. Its robustness allows the usage of a wider range of dictionaries such as an automatically constructed dictionary and crowdsourced dictionary, which are convenient for real-world applications.
翻译:在跨语言文本分类中,需要以高资源源语言提供任务特定培训数据,任务与低资源目标语言相同;然而,由于标签成本、任务特点和隐私问题,收集这种培训数据可能不可行。本文件建议了一种替代解决办法,即仅使用高资源语言和双语词典中基于任务的单词嵌入高资源语言和双语词典。首先,我们从双语词典中建立一个基于字典的多元图(DHG),这为使用图表神经网络进行跨语言传输开辟了可能性。剩下的挑战在于DHG的异质性,因为考虑到多种语言。为了应对这一挑战,我们提议了基于字典的多种不同图形神经网络(DHGNet),通过两步组合有效地处理DHG的异性。这是文字级和语言级汇总。实验结果表明,我们的方法超越了预先培训的模式,尽管它无法进入大型公司。此外,它也可以很好地运行DHGG,尽管它具有多种方便的字典,但是它也能够运行一个更加精确的字典,因为它能自动地将它变成一种不准确的字典。