存储高效接近影响函数的深神经网络数据清理 (Data Cleansing for Deep Neural Networks with Storage-efficient Approximation of Influence Functions)

Identifying the influence of training data for data cleansing can improve the accuracy of deep learning. An approach with stochastic gradient descent (SGD) called SGD-influence to calculate the influence scores was proposed, but, the calculation costs are expensive. It is necessary to temporally store the parameters of the model during training phase for inference phase to calculate influence sores. In close connection with the previous method, we propose a method to reduce cache files to store the parameters in training phase for calculating inference score. We only adopt the final parameters in last epoch for influence functions calculation. In our experiments on classification, the cache size of training using MNIST dataset with our approach is 1.236 MB. On the other hand, the previous method used cache size of 1.932 GB in last epoch. It means that cache size has been reduced to 1/1,563. We also observed the accuracy improvement by data cleansing with removal of negatively influential data using our approach as well as the previous method. Moreover, our simple and general proposed method to calculate influence scores is available on our auto ML tool without programing, Neural Network Console. The source code is also available.

翻译：确定数据清理培训数据的影响可以提高深层学习的准确性。提出了一个叫做 SGD- impact( SGD- impact) 以计算影响分数的方法, 但计算成本非常昂贵。有必要将模型参数暂时存储在测试阶段的培训阶段, 以计算疼痛影响。与前一种方法密切关联, 我们建议了一种方法, 减少缓存文件, 以存储用于计算推算分的培训阶段的参数。我们只是在最后的时段采用最后的参数来计算影响函数。在我们的分类实验中, 使用 MNIST 数据集进行的培训的缓存大小为 1.236 MB 。另一方面, 上次的缓存规模为 1. 932 GB 。这意味着缓存规模已经缩小到 1/1 563 。我们还观察到通过数据清理来改进准确性, 通过使用我们的方法和前一种方法删除有负影响的数据。此外, 我们用于计算影响分数的简单和一般的拟议方法, 可以在不编程的自动 ML 工具 Neural 网络中找到源码。