The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
翻译:自动化数据收集系统和感官学进步的激增正在增加我们能够实时监测的数据数量。然而,鉴于注释成本高,质量检查需要时间,数据往往以不贴标签的形式提供。这有利于使用积极学习开发软传感器和预测模型。在制作过程中,不是进行随机检查以获取产品信息,而是通过评价未贴标签数据的信息内容来收集标签。文献中提出了若干回归查询战略框架,但大部分重点都用于静态集合情景。在这项工作中,我们为基于流的情景提出了新的战略,即连续向学习者提供实例,后者必须瞬间决定是否进行质量检查以获得标签或放弃实例。这种方法受最佳实验设计理论和决策进程的迭接因素的启发,通过设定未贴标签数据点的信息性阈值来解决。拟议的方法是使用数字模拟和田纳西东部进程模拟器来评估。结果证实,根据拟议减少的模型,可以更快地选择拟议减少的模型。