通过重要性复制为语言模型选择数据 (Data Selection for Language Models via Importance Resampling) - 专知论文

会员服务 ·

0

数据选择 · 特征空间 · 语言模型化 · 可约的 · Performer ·

2023 年 2 月 6 日

Data Selection for Language Models via Importance Resampling

翻译：通过重要性复制为语言模型选择数据

Sang Michael Xie,Shibani Santurkar,Tengyu Ma,Percy Liang

Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.

翻译：选择合适的培训数据集对于一般域(例如,GPT-3)和特定域(例如,codx)语言模型(LMs)都至关重要。我们将这一数据选择问题正式化为选择大型原始无标签数据集的子集,以匹配理想的目标分布,因为有些未贴标签的目标样本。由于原始文本数据的规模和广度很大,现有方法使用简单的惯性来选择与高质量参考集(例如,r2.5)相似的数据(例如,Wikipedia),或利用专家手动整理数据。相反,我们将低二位数据选择中使用的典型重要性重校验方法推广到低二位数据选择数据数据选择的精度。我们通过观察,我们用低二位的域标度数据选择S(我们通过普通的ngr=0.89),通过在普通的 ngrgropeal 上进行数据排序。我们用高比重评估,我们用SimealS

0

相关内容

数据选择

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

专知会员服务

93+阅读 · 2020年2月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

基于高空间分辨电子显微学In2-xGaxO3(ZnO)m缺陷分析

国家自然科学基金

0+阅读 · 2015年12月31日

单端反射镀膜长周期光栅的钢筋锈蚀传感机理及方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

重载车辆ECAS/CTIS集成系统耦合机理及主动控制研究

国家自然科学基金

0+阅读 · 2013年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

基于Rugate薄膜的高功率激光非聚焦型空间低通滤波技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于多通道信息处理和缺陷图像重建的木材应力波无损检测技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

泥沙在吸附和絮凝过程中表面电荷的特性研究

国家自然科学基金

0+阅读 · 2009年12月31日

大规模分布式系统实时可预测在线分析研究

国家自然科学基金

1+阅读 · 2008年12月31日

Correcting for Selection Bias and Missing Response in Regression using Privileged Information

Arxiv

0+阅读 · 2023年3月29日

MuRAL: Multi-Scale Region-based Active Learning for Object Detection

Arxiv

0+阅读 · 2023年3月29日

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Arxiv

0+阅读 · 2023年3月29日

Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data

Arxiv

0+阅读 · 2023年3月27日

FAStEN: an efficient adaptive method for feature selection and estimation in high-dimensional functional regressions

Arxiv

0+阅读 · 2023年3月26日

Task Residual for Tuning Vision-Language Models

Arxiv

0+阅读 · 2023年3月24日

DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

Arxiv

0+阅读 · 2023年3月24日

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Arxiv

0+阅读 · 2023年3月24日

FixFit: using parameter-compression to solve the inverse problem in overdetermined models

Arxiv

0+阅读 · 2023年3月24日

A Closer Look at Scoring Functions and Generalization Prediction

Arxiv

0+阅读 · 2023年3月23日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

【新书】数字图像(影像)处理手第二版，2176pdf，Mathematical Methods in Imaging

专知会员服务

93+阅读 · 2020年2月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Correcting for Selection Bias and Missing Response in Regression using Privileged Information

Arxiv

0+阅读 · 2023年3月29日

MuRAL: Multi-Scale Region-based Active Learning for Object Detection

Arxiv

0+阅读 · 2023年3月29日

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Arxiv

0+阅读 · 2023年3月29日

Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data

Arxiv

0+阅读 · 2023年3月27日

FAStEN: an efficient adaptive method for feature selection and estimation in high-dimensional functional regressions

Arxiv

0+阅读 · 2023年3月26日

Task Residual for Tuning Vision-Language Models

Arxiv

0+阅读 · 2023年3月24日

DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

Arxiv

0+阅读 · 2023年3月24日

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Arxiv

0+阅读 · 2023年3月24日

FixFit: using parameter-compression to solve the inverse problem in overdetermined models

Arxiv

0+阅读 · 2023年3月24日

A Closer Look at Scoring Functions and Generalization Prediction

Arxiv

0+阅读 · 2023年3月23日

相关基金

基于高空间分辨电子显微学In2-xGaxO3(ZnO)m缺陷分析

国家自然科学基金

0+阅读 · 2015年12月31日

单端反射镀膜长周期光栅的钢筋锈蚀传感机理及方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

重载车辆ECAS/CTIS集成系统耦合机理及主动控制研究

国家自然科学基金

0+阅读 · 2013年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

基于Rugate薄膜的高功率激光非聚焦型空间低通滤波技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于多通道信息处理和缺陷图像重建的木材应力波无损检测技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

泥沙在吸附和絮凝过程中表面电荷的特性研究

国家自然科学基金

0+阅读 · 2009年12月31日

大规模分布式系统实时可预测在线分析研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员