Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored, assuming they should be retrieved relying on the cooccurence statistics. On the other hand, as language models become larger and require more data to learn reliable representations, such assumptions may start to fall back, especially under conditions of data sparsity. Many languages, including Chinese and Vietnamese, use logographic writing systems where surface forms are represented as a visual organization of smaller graphemic units, which often contain many semantic cues. In this paper, we present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations. We test our hypothesis in the natural language inference (NLI) task by evaluating the benefit of computing multi-modal representations that combine contextual information with glyph information. Our evaluation results in six languages with different typology and writing systems suggest significant benefits of using multi-modal embeddings in languages with logograhic systems, especially for words with less occurence statistics.
翻译:统计语言模式通常根据文字或其他正式单位的背景分布进行代表性学习,而与书面文字的逻辑特征有关的任何信息往往被忽视,假定应该依靠共同性统计来检索,假定它们应当取回。另一方面,随着语言模式的扩大和需要更多数据来学习可靠的表述,这种假设可能开始倒退,特别是在数据分散的条件下。许多语言,包括中文和越南语言,使用地表形式的逻辑书写系统,作为小型图形单位的视觉组织,通常包含许多语义提示。我们在本文件中提出一项新研究,探讨在学习更好的语义表述方面提供具有逻辑特征信息的语文模型的好处。我们通过评估将背景信息与拼写信息相结合的多模式表述的效益,检验我们用自然语言推断(NLI)任务所作的假设。我们用具有不同类型和写法的六种语言的评价结果表明,使用带有标识系统的语言的多模式嵌入,特别是用较少发的文字统计,具有重大效益。