图像机器学习中数据集漂移控制的数据模型 (Data Models for Dataset Drift Controls in Machine Learning With Images)

Luis Oala,Marco Aversa,Gabriel Nobis,Kurt Willis,Yoan Neuenschwander,Michèle Buck,Christian Matek,Jerome Extermann,Enrico Pomarico,Wojciech Samek,Roderick Murray-Smith,Christoph Clausen,Bruno Sanguinetti

from arxiv, LO and MA contributed equally

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This makes it difficult to create physically faithful drift test cases or to provide specifications of data models that should be avoided when deploying a machine learning model. In this study, we demonstrate how these shortcomings can be overcome by pairing machine learning robustness validation with physical optics. We examine the role raw sensor data and differentiable data models can play in controlling performance risks related to image dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases. The experiments presented here show that the average decrease in model performance is ten to four times less severe than under post-hoc augmentation testing. Second, the gradient connection between task and data models allows for drift forensics that can be used to specify performance-sensitive data models which should be avoided during deployment of a machine learning model. Third, drift adjustment opens up the possibility for processing adjustments in the face of drift. This can lead to speed up and stabilization of classifier training at a margin of up to 20% in validation accuracy. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.

翻译：相机图像在机器学习研究中无处不在, 在提供包括医学和环境调查在内的重要服务时, 相机图像也发挥着核心作用。但是, 机器学习模型在这些领域的应用由于稳健性关切而受到限制。一个主要的失败模式是由于培训和部署数据之间的差异而导致性能下降。虽然有各种方法可以预期地验证机器学习模型对于这种数据集漂移的稳健性, 但现有方法并没有考虑到主要感兴趣的对象的清晰模型: 数据。这就使得很难创建物理忠实的漂移测试案例, 或提供在部署机器学习模型时应当避免的数据模型的规格。但是, 我们通过将机器学习稳健性测试与物理光学相配来克服这些缺陷。我们检查原始传感器数据和不同数据模型在控制与图像数据集漂移相关的性风险方面所起的作用。现有方法将发现, 漂流合成可以控制地生成对真实性漂浮性漂浮性测试案例。此处的实验显示, 模型的性能平均减幅比在安装机器学习速度模型时要低10到4倍。在使用移动性分析模型时, 将数据转换数据连接到用于使用漂浮性测试。。将数据连接数据连接到。在使用漂浮性模型期间, 将数据连接中, 将数据连接到将数据转换到将数据转换到将数据转换到进行。

相关内容

Machine Learning

关注 2240

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日