使用“招数法”创建自定义事件数据: 无需字典 (Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks) - 专知论文

会员服务 ·

0

事件 · 数据集 · 快速生成 · 分类器 · 示例 ·

2023 年 4 月 3 日

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

翻译：使用“招数法”创建自定义事件数据: 无需字典

Andrew Halterman,Philip A. Schrodt,Andreas Beger,Benjamin E. Bagozzi,Grace I. Scarborough

Event data, or structured records of ``who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics. The high cost of developing new event datasets, especially using automated systems that rely on hand-built dictionaries, means that most researchers draw on large, pre-existing datasets such as ICEWS rather than developing tailor-made event datasets optimized for their specific research question. This paper describes a ``bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP) that allow researchers to rapidly produce customized event datasets. The paper introduces techniques for training an event category classifier with active learning, identifying actors and the recipients of actions in text using large language models and standard machine learning classifiers and pretrained ``question-answering'' models from NLP, and resolving mentions of actors to their Wikipedia article to categorize them. We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS, along with examples of how scholars can quickly produce smaller, custom event datasets. We publish example code and models to implement our new techniques.

翻译：摘要：事件数据，即从文本中自动提取的“谁对谁做了什么”的结构化记录，是国际政治学者的重要数据来源。开发新的事件数据集的高成本，特别是使用依赖手工构建字典的自动化系统，意味着大多数研究人员使用大型预先存在的数据集，例如ICEWS，而不是开发定制的事件数据集，以优化他们的具体研究问题。本文描述了一种“招数法”，用于高效的自定义事件数据生成，利用自然语言处理(NLP)的最新进展，使研究人员能够快速生成定制的事件数据集。本文介绍了使用主动学习训练事件类别分类器的技术，使用大型语言模型和标准机器学习分类器以及NLP中预训练的“问答”模型，在文本中识别演员和行动的受体以及解决演员提及并将其分配到他们的维基百科文章中进行分类的技术。我们描述了这些技术是如何产生新的POLECAT全球事件数据集，旨在替代ICEWS，以及学者如何快速生成较小的自定义事件数据集的示例。我们发布了实施我们的新技术的示例代码和模型。

0

相关内容

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

专知会员服务

29+阅读 · 2022年5月22日

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

专知会员服务

60+阅读 · 2022年5月5日

【Manning新书】自动机器学习实战，Automated Machine Learning in Action

【Manning新书】自动机器学习实战，Automated Machine Learning in Action

专知会员服务

95+阅读 · 2022年4月8日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

【KDD2021】元自训练的少样本神经序列标记

专知会员服务

32+阅读 · 2021年7月2日

【KDD2020-Tutorial】自动推荐系统，Automated Recommendation System

【KDD2020-Tutorial】自动推荐系统，Automated Recommendation System

专知会员服务

53+阅读 · 2020年8月25日

【论文推荐】针对公民投诉的时空分类法标签推荐 STAR: Spatio-Temporal Taxonomy-Aware Tag Recommendation for Citizen Complaints

【论文推荐】针对公民投诉的时空分类法标签推荐 STAR: Spatio-Temporal Taxonomy-Aware Tag Recommendation for Citizen Complaints

专知会员服务

16+阅读 · 2020年7月20日

搜狗开源机器阅读理解工具箱

搜狗开源机器阅读理解工具箱

专知

19+阅读 · 2019年5月16日

一文带你读懂自然语言处理 - 事件提取

一文带你读懂自然语言处理 - 事件提取

AI研习社

10+阅读 · 2019年5月10日

腾讯词向量实战：通过Annoy进行索引和快速查询

腾讯词向量实战：通过Annoy进行索引和快速查询

AINLP

11+阅读 · 2019年4月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新五篇视频分类相关论文—细粒度行人识别、群组归一化、MLtuner、时序特征

【论文推荐】最新五篇视频分类相关论文—细粒度行人识别、群组归一化、MLtuner、时序特征

专知

22+阅读 · 2018年4月21日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

机器学习研究会

13+阅读 · 2017年12月25日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

AG-WUS-PcG-lncRNA互作对梅多雌蕊发育的调控

国家自然科学基金

0+阅读 · 2015年12月31日

基于搜索反馈的移动用户个性化要素型事件摘要模型研究

国家自然科学基金

0+阅读 · 2014年12月31日

miR-29b在Ang-II诱导肾小管上皮间充质转分化中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

基于溯源的高效智能的入侵检测与数据重建方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

语料标注标准的自动迁移研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于叙事模式分析的无监督新闻事件语义抽取研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于目标模型的横切关注点识别及语义连接点定义方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

星系光谱自动分析与特殊天体自动搜寻研究

国家自然科学基金

0+阅读 · 2012年12月31日

InSAR支持下数据与知识驱动的区域滑坡空间预测

国家自然科学基金

0+阅读 · 2012年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Arxiv

0+阅读 · 2023年5月22日

Text-based Person Search without Parallel Image-Text Data

Arxiv

0+阅读 · 2023年5月22日

QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations

Arxiv

0+阅读 · 2023年5月19日

Collective Reasoning for Safe Autonomous Systems

Arxiv

0+阅读 · 2023年5月18日

Attacks on Online Learners: a Teacher-Student Analysis

Arxiv

0+阅读 · 2023年5月18日

Augmented Large Language Models with Parametric Knowledge Guiding

Arxiv

0+阅读 · 2023年5月18日

Time-Series Event Prediction with Evolutionary State Graph

Arxiv

14+阅读 · 2020年11月25日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

Touch Your Heart: A Tone-aware Chatbot for Customer Care on Social Media

Arxiv

11+阅读 · 2018年3月8日

Zero-Shot Transfer Learning for Event Extraction

Arxiv

10+阅读 · 2017年7月4日

VIP会员

文章信息

相关主题

相关VIP内容

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

【ACL2022教程】有限文本数据学习，Learning with Limited Text Data

专知会员服务

29+阅读 · 2022年5月22日

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

专知会员服务

60+阅读 · 2022年5月5日

【Manning新书】自动机器学习实战，Automated Machine Learning in Action

【Manning新书】自动机器学习实战，Automated Machine Learning in Action

专知会员服务

95+阅读 · 2022年4月8日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

【KDD2021】元自训练的少样本神经序列标记

专知会员服务

32+阅读 · 2021年7月2日

【KDD2020-Tutorial】自动推荐系统，Automated Recommendation System

【KDD2020-Tutorial】自动推荐系统，Automated Recommendation System

专知会员服务

53+阅读 · 2020年8月25日

【论文推荐】针对公民投诉的时空分类法标签推荐 STAR: Spatio-Temporal Taxonomy-Aware Tag Recommendation for Citizen Complaints

【论文推荐】针对公民投诉的时空分类法标签推荐 STAR: Spatio-Temporal Taxonomy-Aware Tag Recommendation for Citizen Complaints

专知会员服务

16+阅读 · 2020年7月20日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICCV2025教程】基础模型遇见具身智能体

军事机器学习设计：关于开发自动化任务摘要系统的梯次化设计科学研究 | 2025最新93页

扩散模型中的缓存方法综述：迈向高效的多模态生成

【ICCV2025教程】《迈向视觉语言模型的全面推理》

相关资讯

搜狗开源机器阅读理解工具箱

搜狗开源机器阅读理解工具箱

专知

19+阅读 · 2019年5月16日

一文带你读懂自然语言处理 - 事件提取

一文带你读懂自然语言处理 - 事件提取

AI研习社

10+阅读 · 2019年5月10日

腾讯词向量实战：通过Annoy进行索引和快速查询

腾讯词向量实战：通过Annoy进行索引和快速查询

AINLP

11+阅读 · 2019年4月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新五篇视频分类相关论文—细粒度行人识别、群组归一化、MLtuner、时序特征

【论文推荐】最新五篇视频分类相关论文—细粒度行人识别、群组归一化、MLtuner、时序特征

专知

22+阅读 · 2018年4月21日

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

【论文推荐】最新5篇信息抽取（IE）相关论文—开放信息抽取、不完整信息、主动学习、越南语、依存分析

专知

12+阅读 · 2018年2月2日

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

【推荐】(Python)多种模型(Naive Bayes, SVM, CNN, LSTM, etc)实现推文情感分析

机器学习研究会

13+阅读 · 2017年12月25日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Arxiv

0+阅读 · 2023年5月22日

Text-based Person Search without Parallel Image-Text Data

Arxiv

0+阅读 · 2023年5月22日

QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations

Arxiv

0+阅读 · 2023年5月19日

Collective Reasoning for Safe Autonomous Systems

Arxiv

0+阅读 · 2023年5月18日

Attacks on Online Learners: a Teacher-Student Analysis

Arxiv

0+阅读 · 2023年5月18日

Augmented Large Language Models with Parametric Knowledge Guiding

Arxiv

0+阅读 · 2023年5月18日

Time-Series Event Prediction with Evolutionary State Graph

Arxiv

14+阅读 · 2020年11月25日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

Touch Your Heart: A Tone-aware Chatbot for Customer Care on Social Media

Arxiv

11+阅读 · 2018年3月8日

Zero-Shot Transfer Learning for Event Extraction

Arxiv

10+阅读 · 2017年7月4日

相关基金

AG-WUS-PcG-lncRNA互作对梅多雌蕊发育的调控

国家自然科学基金

0+阅读 · 2015年12月31日

基于搜索反馈的移动用户个性化要素型事件摘要模型研究

国家自然科学基金

0+阅读 · 2014年12月31日

miR-29b在Ang-II诱导肾小管上皮间充质转分化中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

基于溯源的高效智能的入侵检测与数据重建方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

语料标注标准的自动迁移研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于叙事模式分析的无监督新闻事件语义抽取研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于目标模型的横切关注点识别及语义连接点定义方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

星系光谱自动分析与特殊天体自动搜寻研究

国家自然科学基金

0+阅读 · 2012年12月31日

InSAR支持下数据与知识驱动的区域滑坡空间预测

国家自然科学基金

0+阅读 · 2012年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员