Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be good at identification in poorly represented conditions. We offer a way of informing subsequent data acquisition to maximize model performance by leveraging the toolkit of computer experiments and metadata describing the circumstances under which the training data was collected (e.g., season, time of day, location). We do this by evaluating the learner as the training data is varied according to its metadata. A Gaussian process (GP) surrogate fit to that response surface can inform new data acquisitions. This meta-learning approach offers improvements to learner performance as compared to data with randomly selected metadata, which we illustrate on both classic learning examples, and on a motivating application involving the collection of aerial images in search of airplanes.
翻译:为机器学习模型收集具有操作现实性的数据成本高昂。在采集新数据之前,理解模型在何处存在缺陷至关重要。例如,在稀有物体图像上训练的目标检测器可能在代表性不足的条件下识别能力较差。本文提出一种方法,通过利用计算机实验工具包以及描述训练数据采集环境(如季节、时间、地点)的元数据,为后续数据采集提供指导以最大化模型性能。该方法通过根据元数据调整训练数据来评估学习器的表现。拟合该响应曲面的高斯过程(GP)代理模型可为新数据采集提供决策依据。相较于随机选择元数据的数据采集方式,这种元学习方法能够提升学习器性能。我们通过在经典学习案例以及涉及航空图像采集以搜寻飞机的实际应用中进行验证,展示了该方法的有效性。