通过差异奖励优化数据使用 (Optimizing Data Usage via Differentiable Rewards)

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that "adapts" to its current learning state and estimates the importance of each training data instance. Training such an adaptive scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance at a given time during the training, it is typically necessary to first complete the entire training process. To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS). In DDS, we formulate a scorer network as a learnable function of the training data, which can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a dev set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers strong and consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.

翻译：为了获得新的技能,如果一名教师根据他们目前的知识水平,告知他们应当对特定内容或实践问题给予多大程度的注意,人类就会学习得更好、更快。同样,机器学习模式有可能得到更好的培训,因为一个“适应”到其当前学习状态的得分器,并估计每个培训数据实例的重要性。培训这样一个适应性得分器是一个具有挑战性的问题;为了精确地量化培训期间特定时间的数据实例的效果,通常有必要首先完成整个培训过程。为了高效地优化数据使用,我们建议一种强化学习方法,称为可区别数据选择(DDS)。在DDS中,我们开发一个记分器网络,作为培训数据的一个可学习的函数,可以与正在培训的主要模型一起有效地更新。具体地说,DDS用一个直觉的奖励信号更新得分器:它应该将具有类似梯度的数据与一个偏差的数据集相匹配,我们最终喜欢很好地运行。不计算间接费用,DDDS在两个非常不同的机器翻译和图像分类工作中提供强大和一致的改进。