以自动编码器为基础的高维平衡工业数据模型 (Auto-encoder based Model for High-dimensional Imbalanced Industrial Data)

With the proliferation of IoT devices, the distributed control systems are now capturing and processing more sensors at higher frequency than ever before. These new data, due to their volume and novelty, cannot be effectively consumed without the help of data-driven techniques. Deep learning is emerging as a promising technique to analyze these data, particularly in soft sensor modeling. The strong representational capabilities of complex data and the flexibility it offers from an architectural perspective make it a topic of active applied research in industrial settings. However, the successful applications of deep learning in soft sensing are still not widely integrated in factory control systems, because most of the research on soft sensing do not have access to large scale industrial data which are varied, noisy and incomplete. The results published in most research papers are therefore not easily reproduced when applied to the variety of data in industrial settings. Here we provide manufacturing data sets that are much larger and more complex than public open soft sensor data. Moreover, the data sets are from Seagate factories on active service with only necessary anonymization, so that they reflect the complex and noisy nature of real-world data. We introduce a variance weighted multi-headed auto-encoder classification model that fits well into the high-dimensional and highly imbalanced data. Besides the use of weighting or sampling methods to handle the highly imbalanced data, the model also simultaneously predicts multiple outputs by exploiting output-supervised representation learning and multi-task weighting.

翻译：随着IoT装置的扩散,分布式控制系统现在正在以比以往任何时候更高的频率捕捉和处理更多的传感器。这些新数据,由于其数量和新颖性,如果没有数据驱动技术的帮助,这些新数据无法有效消费。深层学习正在成为分析这些数据的有希望的技术,特别是在软传感器模型方面。复杂的数据的强大代表性能力和它从建筑学角度提供的灵活性使它成为工业环境中积极应用研究的话题。然而,软遥感的深层学习的成功应用仍然没有被广泛纳入工厂控制系统,因为大多数软感学研究都无法获取各种、吵闹和不完整的大规模工业数据。因此,大多数研究论文公布的结果在工业环境中应用这些数据的种类时不容易复制。我们在这里提供的制造数据集比公开开放软传感器数据要大得多和复杂得多。此外,数据集来自Segate工厂,它只是提供主动服务,只有必要的匿名,因此它们反映真实世界数据的复杂和密集性质。我们采用了一种差异加权的多层层结构模型模型,并且从高层层层数据模型到高层数据利用的不平衡性,并超越了高层层数据分析的多重分析模型。