DravidianCodeMix: 编码混合文本中Dravidian语言的感化分析和进攻性语言识别数据集 (DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text)

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).

翻译：本文介绍了为社交媒体评论产生的三种资源不足的德拉维迪语制作的多语种人工附加说明数据集的情况,数据集用于情绪分析和攻击性语言识别,共60 000多条YouTube评论,数据集包括大约44 000条泰米尔语-英语评论,约7 000条Kannada英语评论,约20 000条Malayalam-英语评论,这些数据由自愿助教员手工附加说明,在Krippendorf's alpha 中具有高度的跨咨询协议。数据集包含所有类型的代码混合现象,因为它包括来自多语种国家的用户生成的内容。我们还介绍了利用机器学习方法建立数据集基准的基准实验。数据集可在Github(https://github.com/bharathichichhiyan/DravidianCodeMix-Dataset)和Zenodo(https://zenodo.org/record/470858 ⁇.YJtwSY0SY0ZM)上查阅。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【康奈尔大学】度量数据粒度，Measuring Dataset Granularity

专知会员服务

13+阅读 · 2019年12月27日

【NAACL 2019 workshop】相似语言、变体和方言自然语言处理 The workshop on NLP for Similar Languages, Varieties and Dialects，约翰斯·霍普金斯大学|David Yarowsky

专知会员服务

5+阅读 · 2019年12月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日