Segmentation remains an important preprocessing step both in languages where "words" or other important syntactic/semantic units (like morphemes) are not clearly delineated by white space, as well as when dealing with continuous speech data, where there is often no meaningful pause between words. Near-perfect supervised methods have been developed for use in resource-rich languages such as Chinese, but many of the world's languages are both morphologically complex, and have no large dataset of "gold" segmentations into meaningful units. To solve this problem, we propose a new type of Segmental Language Model (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021) for use in both unsupervised and lightly supervised segmentation tasks. We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture, harnessing the power of a bi-directional masked modeling context and attention. In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese (PKU Corpus) in segmentation quality, and performs similarly to the Recurrent model on English (PTB). We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.
翻译:语言“ 字” 或其他重要的综合/ 语义单位( 如 Mophemes ) 没有被白空间明确划定, 以及处理连续言语数据时, 往往没有有意义的文字间暂停, 几乎完美的监管方法已经开发出来, 用于资源丰富的语言( 如中文), 但世界上许多语言都具有形态复杂, 没有将“ 古代” 分割成有意义的单位的大型数据集。 为了解决这个问题, 我们提议了一种新的部分语言模型( 2018年的Sun和Deng; Kawakami等人, 2019年的Kawakami等人; Wang等人, 2021年的Wang等人), 供不受监督和轻度监督的分解任务使用。 我们引入了在宽度变形器结构上建立的遮掩的分流语言模型, 利用双向遮盖的模型背景和关注力。 在一系列实验中, 我们的模型始终超越了中文( PKU Corpus) 的中主控 。