如何选择文本数据增强的“ 良好” 样本 (How to choose "Good" Samples for Text Data Augmentation) - 专知论文

会员服务 ·

0

数据增强 · 样本 · MoDELS · 查全率/召回率 · 分类模型 ·

2023 年 2 月 2 日

How to choose "Good" Samples for Text Data Augmentation

翻译：如何选择文本数据增强的“ 良好” 样本

Xiaotian Lin,Nankai Lin,Yingwen Fu,Ziyu Yang,Shengyi Jiang

Deep learning-based text classification models need abundant labeled data to obtain competitive performance. Unfortunately, annotating large-size corpus is time-consuming and laborious. To tackle this, multiple researches try to use data augmentation to expand the corpus size. However, data augmentation may potentially produce some noisy augmented samples. There are currently no works exploring sample selection for augmented samples in nature language processing field. In this paper, we propose a novel self-training selection framework with two selectors to select the high-quality samples from data augmentation. Specifically, we firstly use an entropy-based strategy and the model prediction to select augmented samples. Considering some samples with high quality at the above step may be wrongly filtered, we propose to recall them from two perspectives of word overlap and semantic similarity. Experimental results show the effectiveness and simplicity of our framework.

翻译：深层次的基于学习的文本分类模型需要大量标签数据才能取得竞争性的性能。不幸的是,批注大型体体耗时费时费力。要解决这个问题,多项研究试图利用数据增强来扩大体积规模。然而,数据增强可能会产生一些噪音增加的样本。目前没有研究自然语言处理领域增量样本的样本选择工作。在本文中,我们建议建立一个新型自我培训选择框架,由两个选择者选择数据增强的高质量样本。具体地说,我们首先使用基于酶的战略和模型预测来选择增量样本。考虑到上述步骤中一些高质量样本可能被错误过滤,我们提议从文字重叠和语义相似性这两个角度来回顾这些样本。实验结果显示了我们框架的有效性和简洁性。

0

相关内容

数据增强

数据增强在机器学习领域多指采用一些方法（比如数据蒸馏，正负样本均衡等）来提高模型数据集的质量，增强数据。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

基因、环境及其交互作用与肌萎缩侧索硬化发病、临床表型及预后相关性研究

国家自然科学基金

0+阅读 · 2014年12月31日

线粒体TRAP1分子介导Ago2蛋白表达在肠癌转移中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

COMET实验CDC软件发展和数据处理

国家自然科学基金

0+阅读 · 2014年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

GaN/电解液界面的原位STM研究

国家自然科学基金

0+阅读 · 2012年12月31日

吸入糖皮质激素：降低了羧甲司坦改善熏烟大鼠气道细菌清除能力的作用？

国家自然科学基金

0+阅读 · 2012年12月31日

NOX-ROS-TRPM2信号通路在肺高压发病过程中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

Arc在颞叶癫痫发生及认知功能损害中的作用

国家自然科学基金

0+阅读 · 2009年12月31日

Inherent Consistent Learning for Accurate Semi-supervised Medical Image Segmentation

Inherent Consistent Learning for Accurate Semi-supervised Medical Image Segmentation

Arxiv

0+阅读 · 2023年3月24日

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

Arxiv

0+阅读 · 2023年3月23日

A Survey on Data Augmentation for Text Classification

A Survey on Data Augmentation for Text Classification

Arxiv

16+阅读 · 2021年7月7日

Counterfactual Zero-Shot and Open-Set Visual Recognition

Arxiv

12+阅读 · 2021年3月1日

Graph Contrastive Learning with Adaptive Augmentation

Arxiv

10+阅读 · 2021年2月26日

Data Augmentation for Graph Neural Networks

Arxiv

38+阅读 · 2020年12月2日

Semi-supervised Medical Image Segmentation through Dual-task Consistency

Arxiv

14+阅读 · 2020年9月9日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

A Simple Framework for Contrastive Learning of Visual Representations

Arxiv

21+阅读 · 2020年2月13日

VIP会员

文章信息

相关主题

查全率/召回率

相关VIP内容

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

相关论文

Inherent Consistent Learning for Accurate Semi-supervised Medical Image Segmentation

Inherent Consistent Learning for Accurate Semi-supervised Medical Image Segmentation

Arxiv

0+阅读 · 2023年3月24日

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

Arxiv

0+阅读 · 2023年3月23日

A Survey on Data Augmentation for Text Classification

A Survey on Data Augmentation for Text Classification

Arxiv

16+阅读 · 2021年7月7日

Counterfactual Zero-Shot and Open-Set Visual Recognition

Arxiv

12+阅读 · 2021年3月1日

Graph Contrastive Learning with Adaptive Augmentation

Arxiv

10+阅读 · 2021年2月26日

Data Augmentation for Graph Neural Networks

Arxiv

38+阅读 · 2020年12月2日

Semi-supervised Medical Image Segmentation through Dual-task Consistency

Arxiv

14+阅读 · 2020年9月9日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

A Simple Framework for Contrastive Learning of Visual Representations

Arxiv

21+阅读 · 2020年2月13日

相关基金

基因、环境及其交互作用与肌萎缩侧索硬化发病、临床表型及预后相关性研究

国家自然科学基金

0+阅读 · 2014年12月31日

线粒体TRAP1分子介导Ago2蛋白表达在肠癌转移中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

COMET实验CDC软件发展和数据处理

国家自然科学基金

0+阅读 · 2014年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

GaN/电解液界面的原位STM研究

国家自然科学基金

0+阅读 · 2012年12月31日

吸入糖皮质激素：降低了羧甲司坦改善熏烟大鼠气道细菌清除能力的作用？

国家自然科学基金

0+阅读 · 2012年12月31日

NOX-ROS-TRPM2信号通路在肺高压发病过程中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

Arc在颞叶癫痫发生及认知功能损害中的作用

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员