文本数据质量过滤经验探索 (An Empirical Exploration in Quality Filtering of Text Data) - 专知论文

会员服务 ·

0

Performer · MoDELS · 语言模型化 · 模型性能 · 稳健性 ·

2021 年 10 月 6 日

An Empirical Exploration in Quality Filtering of Text Data

翻译：文本数据质量过滤经验探索

from arxiv, corrected typo in citation

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

翻译：传统智慧认为,更积极地从低质量来源(如共同的Crawl)过滤数据,总是单调地提高培训数据的质量,但我们发现,积极性的过滤实际上会导致类似GPT语言模式的一系列广泛的下游任务模型质量下降。我们推测,这是因为,这样做是因为对代用指标的优化足够强烈,会损害真实目标的绩效,表明在试图更积极过滤时需要更强有力的过滤目标。我们希望,这项工作将导致详细分析数据集过滤设计选择对下游模式未来工作业绩的影响。

0

相关内容

Performer

【如何做研究】How to research ，22页ppt

【如何做研究】How to research ，22页ppt

专知会员服务

112+阅读 · 2021年4月17日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【Manning新书】现代Java实战，592页pdf

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

专知会员服务

103+阅读 · 2020年4月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

专知会员服务

93+阅读 · 2020年2月12日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

已删除

将门创投

3+阅读 · 2020年8月3日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

CCF B类期刊IPM专刊截稿信息1条

CCF B类期刊IPM专刊截稿信息1条

Call4Papers

3+阅读 · 2018年10月11日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

五个精彩实用的自然语言处理资源

五个精彩实用的自然语言处理资源

机器学习研究会

6+阅读 · 2018年2月23日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

A Sharp Lower-tail Bound for Gaussian Maxima with Application to Bootstrap Methods in High Dimensions

Arxiv

0+阅读 · 2021年12月1日

Empirical Conditional Mean: A New Method of Predicting Throughput in Uplink Data Network

Arxiv

0+阅读 · 2021年11月28日

Measuring Data Quality for Dataset Selection in Offline Reinforcement Learning

Arxiv

0+阅读 · 2021年11月26日

Nonparametric Empirical Bayes Estimation on Heterogeneous Data

Arxiv

0+阅读 · 2021年11月24日

Deep Neural Networks and Tabular Data: A Survey

Arxiv

9+阅读 · 2021年10月5日

FuxiCTR: An Open Benchmark for Click-Through Rate Prediction

Arxiv

8+阅读 · 2020年9月12日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

Neural Graph Collaborative Filtering

Arxiv

8+阅读 · 2019年5月20日

An end-to-end Neural Network Framework for Text Clustering

An end-to-end Neural Network Framework for Text Clustering

Arxiv

6+阅读 · 2019年3月22日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Arxiv

4+阅读 · 2018年4月26日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【如何做研究】How to research ，22页ppt

【如何做研究】How to research ，22页ppt

专知会员服务

112+阅读 · 2021年4月17日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【Manning新书】现代Java实战，592页pdf

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

【实用书】掌握Python数据分析，282页pdf，Mastering Python Data Analysis

专知会员服务

103+阅读 · 2020年4月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

专知会员服务

93+阅读 · 2020年2月12日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

已删除

将门创投

3+阅读 · 2020年8月3日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

CCF B类期刊IPM专刊截稿信息1条

CCF B类期刊IPM专刊截稿信息1条

Call4Papers

3+阅读 · 2018年10月11日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

五个精彩实用的自然语言处理资源

五个精彩实用的自然语言处理资源

机器学习研究会

6+阅读 · 2018年2月23日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

A Sharp Lower-tail Bound for Gaussian Maxima with Application to Bootstrap Methods in High Dimensions

Arxiv

0+阅读 · 2021年12月1日

Empirical Conditional Mean: A New Method of Predicting Throughput in Uplink Data Network

Arxiv

0+阅读 · 2021年11月28日

Measuring Data Quality for Dataset Selection in Offline Reinforcement Learning

Arxiv

0+阅读 · 2021年11月26日

Nonparametric Empirical Bayes Estimation on Heterogeneous Data

Arxiv

0+阅读 · 2021年11月24日

Deep Neural Networks and Tabular Data: A Survey

Arxiv

9+阅读 · 2021年10月5日

FuxiCTR: An Open Benchmark for Click-Through Rate Prediction

Arxiv

8+阅读 · 2020年9月12日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

Neural Graph Collaborative Filtering

Arxiv

8+阅读 · 2019年5月20日

An end-to-end Neural Network Framework for Text Clustering

An end-to-end Neural Network Framework for Text Clustering

Arxiv

6+阅读 · 2019年3月22日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Arxiv

4+阅读 · 2018年4月26日

微信扫码咨询专知VIP会员