文本数据质量过滤经验探索 (An Empirical Exploration in Quality Filtering of Text Data) - 专知论文

会员服务 ·

0

Performer · MoDELS · 语言模型化 · 模型性能 · 稳健性 ·

2021 年 9 月 2 日

An Empirical Exploration in Quality Filtering of Text Data

翻译：文本数据质量过滤经验探索

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

翻译：传统智慧认为,更积极地从低质量来源(如共同的Crawl)过滤数据,总是单调地提高培训数据的质量,但我们发现,积极性的过滤实际上会导致类似GPT语言模式的一系列广泛的下游任务模型质量下降。我们推测,这是因为,这样做是因为对代用指标的优化足够强烈,会损害真实目标的绩效,表明在试图更积极过滤时需要更强有力的过滤目标。我们希望,这项工作将导致详细分析数据集过滤设计选择对下游模式未来工作业绩的影响。

0

相关内容

Performer

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

专知会员服务

15+阅读 · 2020年8月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【理解计算机视觉损失函数】《Understanding Loss Functions in Computer Vision!》by Sowmya Yellapragad

【理解计算机视觉损失函数】《Understanding Loss Functions in Computer Vision!》by Sowmya Yellapragad

专知会员服务

44+阅读 · 2020年3月4日

【EMNLP 2019 最佳论文】信息瓶颈专门化单词嵌入（用于解析）（Specializing Word Embeddings（for Parsing）by Information Bottleneck）

【EMNLP 2019 最佳论文】信息瓶颈专门化单词嵌入（用于解析）（Specializing Word Embeddings（for Parsing）by Information Bottleneck）

专知会员服务

24+阅读 · 2019年11月20日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

Which Model To Trust: Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms for Continuous Control Tasks

Which Model To Trust: Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms for Continuous Control Tasks

Arxiv

0+阅读 · 2021年10月25日

Text Guide: Improving the quality of long text classification by a text selection method based on feature importance

Arxiv

0+阅读 · 2021年10月25日

Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis

Arxiv

0+阅读 · 2021年10月25日

Problems with information theoretic approaches to causal learning

Arxiv

0+阅读 · 2021年10月24日

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

Arxiv

0+阅读 · 2021年10月23日

Better than Average: Paired Evaluation of NLP Systems

Arxiv

0+阅读 · 2021年10月20日

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

Arxiv

9+阅读 · 2021年2月8日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation

Arxiv

3+阅读 · 2019年6月24日

Deep Semantic Hashing with Generative Adversarial Networks

Arxiv

5+阅读 · 2018年4月23日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

专知会员服务

15+阅读 · 2020年8月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【理解计算机视觉损失函数】《Understanding Loss Functions in Computer Vision!》by Sowmya Yellapragad

【理解计算机视觉损失函数】《Understanding Loss Functions in Computer Vision!》by Sowmya Yellapragad

专知会员服务

44+阅读 · 2020年3月4日

【EMNLP 2019 最佳论文】信息瓶颈专门化单词嵌入（用于解析）（Specializing Word Embeddings（for Parsing）by Information Bottleneck）

【EMNLP 2019 最佳论文】信息瓶颈专门化单词嵌入（用于解析）（Specializing Word Embeddings（for Parsing）by Information Bottleneck）

专知会员服务

24+阅读 · 2019年11月20日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

无人机蜂群——作为执行非常规战争的创新工具 | 2025最新文献

【NeurIPS 2025】以语言为中心的全模态表征学习的可扩展性研究

接纳无人机多样性：西方军事在无人机战争中适应的五个挑战 | 28页报告

高北地区的军用无人机创新：挪威、瑞典和芬兰的比较分析 | 2025最新64页

相关资讯

LibRec 精选：AutoML for Contextual Bandits

LibRec 精选：AutoML for Contextual Bandits

LibRec智能推荐

7+阅读 · 2019年9月19日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

Which Model To Trust: Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms for Continuous Control Tasks

Which Model To Trust: Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms for Continuous Control Tasks

Arxiv

0+阅读 · 2021年10月25日

Text Guide: Improving the quality of long text classification by a text selection method based on feature importance

Arxiv

0+阅读 · 2021年10月25日

Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis

Arxiv

0+阅读 · 2021年10月25日

Problems with information theoretic approaches to causal learning

Arxiv

0+阅读 · 2021年10月24日

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

Arxiv

0+阅读 · 2021年10月23日

Better than Average: Paired Evaluation of NLP Systems

Arxiv

0+阅读 · 2021年10月20日

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

Arxiv

9+阅读 · 2021年2月8日

Meta-Learning to Cluster

Meta-Learning to Cluster

Arxiv

17+阅读 · 2019年10月30日

Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation

Arxiv

3+阅读 · 2019年6月24日

Deep Semantic Hashing with Generative Adversarial Networks

Arxiv

5+阅读 · 2018年4月23日

微信扫码咨询专知VIP会员