LAHM: 用于多领域和多语言仇恨言论识别的大型注释数据集 (LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification) - 专知论文

会员服务 ·

0

注释（编程） · 数据集 · 多领域 · 识别 · 仇恨言论检测 ·

2023 年 4 月 3 日

LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

翻译：LAHM: 用于多领域和多语言仇恨言论识别的大型注释数据集

Ankit Yadav,Shubham Chandel,Sushant Chatufale,Anil Bandhakavi

Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general.

翻译：当前有关仇恨言论分析的研究通常针对单语和单一分类任务。在本文中，我们为英语、印地语、阿拉伯语、法语、德语和西班牙语的多个领域跨仇恨言论 - 虐待、种族主义、性别歧视、宗教仇恨和极端主义提出了一个新的多语言仇恨言论分析数据集。据我们所知，这篇论文是第一篇在这六种语言中解决不同领域中各种类型仇恨言论识别问题的论文。在这项工作中，我们描述了如何创建数据集，为不同领域创建高级别和低级别的注释，以及如何使用它来测试当前最先进的多语言和多任务学习方法。我们评估了我们的数据集在各种单语、跨语言和机器翻译分类设置中，并将其与我们为此任务汇总和合并的英语开放源代码数据集进行比较。然后，我们讨论了如何使用这种方法创建大规模的仇恨言论数据集，以及如何利用我们的注释来改进仇恨言论检测和分类。

0

相关内容

注释（编程）

注释（编程）

注释（编程）

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【干货书】面向程序员的机器学习与人工智能的教科书，681页DF

【干货书】面向程序员的机器学习与人工智能的教科书，681页DF

专知会员服务

121+阅读 · 2021年7月1日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

专知会员服务

79+阅读 · 2020年2月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

AI界的State of the Art都在这里了

AI界的State of the Art都在这里了

机器之心

12+阅读 · 2018年12月10日

博客 | 关于SLU（意图识别、槽填充、上下文LU、结构化LU）和NLG的论文汇总

博客 | 关于SLU（意图识别、槽填充、上下文LU、结构化LU）和NLG的论文汇总

AI研习社

18+阅读 · 2018年11月30日

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

专知

19+阅读 · 2018年3月26日

汉越双语事件语料库构建及舆情观点挖掘方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

半监督进化文本聚类算法在动态多源文本分析上的研究

国家自然科学基金

2+阅读 · 2014年12月31日

听力损伤评价方法及计算模型

国家自然科学基金

0+阅读 · 2014年12月31日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

手性磷铝分子筛负载Ni-P催化剂催化蒎烯不对称加氢反应研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于上转换荧光共振能量转移的超分子手性光敏化

国家自然科学基金

0+阅读 · 2012年12月31日

上转换纳米颗粒介导的光动力疗法修复大鼠脊髓损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

本体导向的大规模语义信息声明式抽取方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

专家检索资源获取与学习排序方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

WEB智能搜索中的文本内容信任判定方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

Multi-View Knowledge Distillation from Crowd Annotations for Out-of-Domain Generalization

Arxiv

0+阅读 · 2023年5月23日

Narrative XL: A Large-scale Dataset For Long-Term Memory Models

Arxiv

0+阅读 · 2023年5月23日

Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks

Arxiv

0+阅读 · 2023年5月22日

Text Classification via Large Language Models

Arxiv

0+阅读 · 2023年5月22日

Textually Pretrained Speech Language Models

Arxiv

0+阅读 · 2023年5月22日

Large Language Models can be Guided to Evade AI-Generated Text Detection

Arxiv

0+阅读 · 2023年5月19日

Analysis of Utterance Embeddings and Clustering Methods Related to Intent Induction for Task-Oriented Dialogue

Arxiv

0+阅读 · 2023年5月19日

Making Pre-trained Language Models Better Few-shot Learners

Arxiv

14+阅读 · 2020年12月31日

Multi-Domain Multi-Task Rehearsal for Lifelong Learning

Multi-Domain Multi-Task Rehearsal for Lifelong Learning

Arxiv

12+阅读 · 2020年12月14日

Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog

Arxiv

30+阅读 · 2020年2月27日

VIP会员

文章信息

相关主题

注释（编程）

仇恨言论检测

相关VIP内容

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【干货书】面向程序员的机器学习与人工智能的教科书，681页DF

【干货书】面向程序员的机器学习与人工智能的教科书，681页DF

专知会员服务

121+阅读 · 2021年7月1日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

专知会员服务

79+阅读 · 2020年2月12日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

AI界的State of the Art都在这里了

AI界的State of the Art都在这里了

机器之心

12+阅读 · 2018年12月10日

博客 | 关于SLU（意图识别、槽填充、上下文LU、结构化LU）和NLG的论文汇总

博客 | 关于SLU（意图识别、槽填充、上下文LU、结构化LU）和NLG的论文汇总

AI研习社

18+阅读 · 2018年11月30日

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

专知

19+阅读 · 2018年3月26日

相关论文

Multi-View Knowledge Distillation from Crowd Annotations for Out-of-Domain Generalization

Arxiv

0+阅读 · 2023年5月23日

Narrative XL: A Large-scale Dataset For Long-Term Memory Models

Arxiv

0+阅读 · 2023年5月23日

Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks

Arxiv

0+阅读 · 2023年5月22日

Text Classification via Large Language Models

Arxiv

0+阅读 · 2023年5月22日

Textually Pretrained Speech Language Models

Arxiv

0+阅读 · 2023年5月22日

Large Language Models can be Guided to Evade AI-Generated Text Detection

Arxiv

0+阅读 · 2023年5月19日

Analysis of Utterance Embeddings and Clustering Methods Related to Intent Induction for Task-Oriented Dialogue

Arxiv

0+阅读 · 2023年5月19日

Making Pre-trained Language Models Better Few-shot Learners

Arxiv

14+阅读 · 2020年12月31日

Multi-Domain Multi-Task Rehearsal for Lifelong Learning

Multi-Domain Multi-Task Rehearsal for Lifelong Learning

Arxiv

12+阅读 · 2020年12月14日

Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog

Arxiv

30+阅读 · 2020年2月27日

相关基金

汉越双语事件语料库构建及舆情观点挖掘方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

半监督进化文本聚类算法在动态多源文本分析上的研究

国家自然科学基金

2+阅读 · 2014年12月31日

听力损伤评价方法及计算模型

国家自然科学基金

0+阅读 · 2014年12月31日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

手性磷铝分子筛负载Ni-P催化剂催化蒎烯不对称加氢反应研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于上转换荧光共振能量转移的超分子手性光敏化

国家自然科学基金

0+阅读 · 2012年12月31日

上转换纳米颗粒介导的光动力疗法修复大鼠脊髓损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

本体导向的大规模语义信息声明式抽取方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

专家检索资源获取与学习排序方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

WEB智能搜索中的文本内容信任判定方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员