The success of deep learning in natural language processing raises intriguing questions about the nature of linguistic meaning and ways in which it can be processed by natural and artificial systems. One such question has to do with subword segmentation algorithms widely employed in language modeling, machine translation, and other tasks since 2016. These algorithms often cut words into semantically opaque pieces, such as 'period', 'on', 't', and 'ist' in 'period|on|t|ist'. The system then represents the resulting segments in a dense vector space, which is expected to model grammatical relations among them. This representation may in turn be used to map 'period|on|t|ist' (English) to 'par|od|ont|iste' (French). Thus, instead of being modeled at the lexical level, translation is reformulated more generally as the task of learning the best bilingual mapping between the sequences of subword segments of two languages; and sometimes even between pure character sequences: 'p|e|r|i|o|d|o|n|t|i|s|t' $\rightarrow$ 'p|a|r|o|d|o|n|t|i|s|t|e'. Such subword segmentations and alignments are at work in highly efficient end-to-end machine translation systems, despite their allegedly opaque nature. The computational value of such processes is unquestionable. But do they have any linguistic or philosophical plausibility? I attempt to cast light on this question by reviewing the relevant details of the subword segmentation algorithms and by relating them to important philosophical and linguistic debates, in the spirit of making artificial intelligence more transparent and explainable.
翻译:在自然语言处理中深层学习的成功引起了对语言意义的性质和语言语言语言和人工系统处理方式的强烈矢量空间中产生的部分的疑问。 其中一个问题是2016年以来在语言建模、机器翻译和其他任务中广泛使用的子字分割算法。 因此,这些算法往往将单词切入语义不透明部分, 如“ 周期”、“ 上 ” 、 “ t ” 和“ 周期 ” 中的“ 文字 ” 。 系统随后代表着一个密集的矢量空间中由此产生的部分, 该矢量空间预计将在它们之间建模语言关系。 这个表达方式可能转而被用于绘制“ 周期 on\ t” 和“ 语言建模 ” 。 因此, 这些算法不是建模的, 而是更普遍的重塑, 学习两种语言子组序列之间最佳的双语绘图; 有时甚至纯性字符序列之间, 它们以 透明 lider_ o\ o\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\