Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .
翻译:在给定的句子或谈话中混合一种以上的语言时,代码转换就会发生。 这种现象在社交媒体平台上更为突出, 并且随着时间的推移, 其采用也不断增多。 因此, 文献中已经广泛研究了代码混合 NLP 。 随着预先训练的基于变压器的架构越来越受欢迎, 我们观察到, 真正的代码混合数据对于预培训大语言模型来说是稀缺的。 我们用罗马文稿展示了L3Cube- HingCorpus, 这是第一个大型真正的真正的印地语和英语代码混合数据。 它由52. 93M 句和 1. 04B 符号组成, 从Twitter中剪贴。 我们进一步展示 HingBERT、 HingMBERT、 HingMBERT、 HingGMLSD 模型, 也是基于GPTB-LLSD 数据库的最大数据转换模型。 我们展示了这些BRB-LSB 数据模型, 在GLDRMUS 数据库中, 也是以GPTLS-RLSD 最大模型生成的模型。