机器学习系统中的错误行为和不宽容 (On Misbehaviour and Fault Tolerance in Machine Learning Systems)

Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new situations and contexts. At the same time, this adaptability raises uncertainties concerning the run-time product quality or dependability, such as reliability and security, of these systems. Systems can be tested and monitored, but this does not provide protection against faults and failures in adapted ML systems themselves. We studied software designs that aim at introducing fault tolerance in ML systems so that possible problems in ML components of the systems can be avoided. The research was conducted as a case study, and its data was collected through five semi-structured interviews with experienced software architects. We present a conceptualisation of the misbehaviour of ML systems, the perceived role of fault tolerance, and the designs used. Common patterns to incorporating ML components in design in a fault tolerant fashion have started to emerge. ML models are, for example, guarded by monitoring the inputs and their distribution, and enforcing business rules on acceptable outputs. Multiple, specialised ML models are used to adapt to the variations and changes in the surrounding world, and simpler fall-over techniques like default outputs are put in place to have systems up and running in the face of problems. However, the general role of these patterns is not widely acknowledged. This is mainly due to the relative immaturity of using ML as part of a complete software system: the field still lacks established frameworks and practices beyond training to implement, operate, and maintain the software that utilises ML. ML software engineering needs further analysis and development on all fronts.

翻译：机器学习(ML)为我们提供了无数机会,使ML系统能够适应新的情况和背景。与此同时,这种适应性增加了关于这些系统的运行时间产品质量或可靠性(如可靠性和安全性)的不确定性。系统可以测试和监测,但这并不能提供保护,防止调整后ML系统本身的故障和故障。我们研究了软件设计,目的是在ML系统中引入错误容忍度,从而避免系统 ML组成部分中可能出现的问题。研究是作为一个案例研究进行的,其数据是通过与有经验的软件设计师的五次半结构访谈收集的。我们介绍了ML系统错误行为的质量或可靠性(如可靠性和安全性)的不确定性。系统可以测试和监测,但系统本身无法提供这些系统,但是,将ML组件纳入设计时的常见模式已经开始出现。例如,ML模型通过监测投入及其分布以及执行关于可接受产出的商业规则来加以保护。多种专门的ML模型仍然用于适应周围世界的变化和变化,而其数据是通过与有经验的软件设计师进行的5次半结构性访谈来收集的。我们介绍了ML系统的错误行为、错觉觉觉作用以及所使用的设计方法,这主要是在ML系统上是公认的。