Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
翻译:近年来,社会媒体文本的语言识别是一个有趣的研究问题。社会媒体信息主要在非英语国家混杂的代码中。培训前背景嵌入器先前的知识显示一系列下游任务的最新成果。最近,BERT等模型显示,使用大量未贴标签的数据,经过预先培训的语言模式更有利于学习通用语言表述。本文介绍了利用传输学习和微调BERT模型来识别Twitter上语言的广泛实验。工作利用了为语言培训前和随后的单语言分类而收集的印地语-英语编码文本和印地语-英语编码组合文本的数据收集。结果显示,通过单语对应方对代码数据进行预先培训,结果更好。