Recent years have seen breakthroughs in neural language models that capture nuances of language, culture, and knowledge. Neural networks are capable of translating between languages -- in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. Our work is motivated by an ambitious interdisciplinary initiative, Project CETI, which is collecting a large corpus of sperm whale communications for machine analysis. We propose a theoretical framework for analyzing UMT when no parallel data are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. The framework requires access to a prior probability distribution that should assign non-zero probability to possible translations. We instantiate our framework with two models of language. Our analysis suggests that accuracy of translation depends on the complexity of the source language and the amount of ``common ground'' between the source language and target prior. We also prove upper bounds on the amount of data required from the source language in the unsupervised setting as a function of the amount of data required in a hypothetical supervised setting. Surprisingly, our bounds suggest that the amount of source data required for unsupervised translation is comparable to the supervised setting. For one of the language models which we analyze we also prove a nearly matching lower bound. Our analysis is purely information-theoretic and as such can inform how much source data needs to be collected, but does not yield a computationally efficient procedure.
翻译:近些年来,神经语言模型取得了突破,反映了语言、文化和知识的细微差别。神经网络能够在语言之间翻译语言,在某些情况下,甚至在两种语言之间,甚至两种语言之间,即使无法或很少获得平行翻译,在所谓的无人监督的机器翻译(UMT)中,神经网络也能够翻译。鉴于这一进展,令人感兴趣的问题是,机器学习工具是否最终能够理解动物交流,特别是高度智能动物的交流。我们的工作受到一个雄心勃勃的跨学科倡议CETI项目的推动,该项目正在收集大量精子鲸通信,供机器分析。我们建议了一个理论框架,用于分析UMT(在没有平行数据的情况下,甚至两种语言之间也是如此),在无法假定源和目标之间分析时,如果源和目标之间没有平行的数据和目标之间,那么当无法假定源和目标之间,当源和目标之间对相关主题或语言的平行翻译进行相关的翻译时,这个框架要求事先的概率分布应该赋予非零概率。我们用两种语言模式来回响我们的框架。我们的分析表明,翻译的准确性取决于源语言的复杂程度,但是“共同”地面”源语言与目标之间的数量。我们也证明,对于几乎无法对数据进行精确的分析。我们的数据的排序中的数据的数值的数值的数值的数值的数值的排序进行。