We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.
翻译:我们探讨了元学习方法在英语、俄语和中文语言独立无监督分词问题上的可能性。我们实现了元学习方法,自动确定无监督分词模型的超参数,这个模型是在之前的研究中提出的,依靠各种不同的与人无关的适应度函数,例如标准化反熵、压缩因子和交叉分裂 F1 分数,以及三个度量的加性和乘性组合,将它们与传统的 F1 分数进行比较。我们发现,在英语和俄语的情况下,后三项度量的加性组合与 F1 分数之间存在着相当良好的相关性。在中文的情况下,我们发现 F1 分数与压缩因子之间存在着显著的相关性。我们的研究结果表明,能够对低资源和死语中进行无监督分词,并允许我们从不同的结构优化方案的角度思考人类语言的演变,这些方案在不同的人类文化中演化出了高效的符号通信编码。