Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and 28\% in relative performance depending on the task. We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm, and that even modest segmentation performance gives a gain in performance on two of the three tasks compared to basic character/phone based models without boundary information.
翻译:在许多下游应用中,基于文字或字形的语文模型(LM)通常比基于字符的模型(LM)更喜欢在许多下游应用中以字为基础的语言模型(LM),这也许并不令人惊讶,因为文字看起来比字符更具有语言相关性。文字至少提供两种相关信息:边界信息和有意义的单位。但是,在语音输入中,单词边界信息可能不存在或不可靠(在语音流中未明确标出字框边界)。在这里,我们系统比较LSTMs作为输入单位(字符、电话、单词、单词部分)的函数,有或没有黄金边界信息。我们在词汇、合成和语义层面的网络中探索语言知识。我们利用三种语音适应的黑盒NLP心理语言触发基准(pWUGGY、pBLIMP、pSIMI),我们发现,在相对性能方面,2 ⁇ 和28 ⁇ 之间没有边界的费用取决于任务。我们显示,黄金边界界线可以被自动找到的分解算法所取代,而即使是适度的分解功能也能在三个边界模型上取得业绩。