Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
翻译:相机图像在机器学习研究中是无处不在的。 相机图像在提供贯穿医学和环境调查的重要服务方面也发挥着核心作用。 但是,由于对稳健性的关注,在这些领域应用机器学习模型的能力有限。 一个主要的失败模式是由于培训和部署数据之间的差异而导致性能下降。 虽然有各种方法可以预期地验证机器学习模型对于这种数据集漂移的稳健性,但现有方法并没有考虑到主要感兴趣的对象的清晰模型:数据。这限制了我们以物理准确的方式研究和理解数据生成和下游机器学习模型的性能之间的关系。 但是,在这项研究中,我们展示了如何通过将传统机器学习模型与物理光学结合起来来获得清晰和不同的数据模型来克服这一局限性。 我们展示了如何为图像数据数据构建这样的数据模型,并用于控制下游机器学习模型的性能。 现有方法被蒸馏成三种应用。 首先, 漂浮合成可以控制地生成物理忠实的漂浮测试案例,以便选择和有针对性的一般化。 其次, 机器学习任务模型和数据模型之间的梯连接可以有效地进行数据流变现。