Natural language processing (NLP) techniques have become mainstream in the recent decade. Most of these advances are attributed to the processing of a single language. More recently, with the extensive growth of social media platforms focus has shifted to code-mixed text. The code-mixed text comprises text written in more than one language. People naturally tend to combine local language with global languages like English. To process such texts, current NLP techniques are not sufficient. As a first step, the text is processed to identify the language of the words in the text. In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text. The task of language identification is formulated as a token classification task. In the supervised setting, each word in the sentence has an associated language label. We evaluate different deep learning models and input representation combinations for this task. Mainly, character, sub-word, and word embeddings are considered in combination with CNN and LSTM based models. We show that sub-word representation along with the LSTM model gives the best results. In general sub-word representations perform significantly better than other input representations. We report the best accuracy of 94.52% using a single layer LSTM model on the standard SAIL ICON 2017 test set.
翻译:近十年来,自然语言处理(NLP)技术成为主流。这些进步大多归功于单一语言的处理。最近,随着社交媒体平台的大幅增长,重点已转向代码混合文本。编码混合文本包含一种以上语言的文本。人们自然倾向于将本地语言与英语等全球语言结合起来。处理这些文本,目前的NLP技术是不够的。作为第一步,对文本进行处理,以确定文本中的文字语言。在这项工作中,我们侧重于印地语-英语混合文本编码混合句中的语言识别。语言识别任务已作为象征性分类任务制定。在受监督的设置中,该句中的每个词都有一个相关的语言标签。我们评估了这项任务的不同深度学习模式和输入代表组合。主要考虑的是,性质、子字和词嵌入式与CNN和基于LSTM的模型相结合。我们展示了分词的表达方式和LSTM模式提供了最佳的结果。在一般的子词表达模式中,语言识别任务将作为一种象征性的分类任务。在受监督的设置的设置中,每个词都有一个相关的语言标签标签。我们用一个标准的单一标准标准标准标准标准标准标准。