Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.
翻译:社交媒体在跨文化交流中起着重要作用。其中大量的交流以混用代码和多语言形式出现,这给自然语言处理(NLP)工具的处理带来了相当的挑战,如语种识别,主题建模和命名实体识别。为了解决这一问题,我们引入了一个大规模的、多语种和多主题的数据集(MMT),从Twitter(170万条推文)收集,并涵盖印度情境下的13个粗粒度和63个细粒度主题。我们还使用各种印度语言及其混用代码的对照子集,注释了MMT数据集中的5346个推文。我们同时演示了当前现有的工具在两个下游任务上无法捕捉MMT的语言多样性,即主题建模和语言识别。为了方便未来的研究,我们将把匿名和注释的数据集公开在公共领域。