Deep learning plays a more and more important role in our daily life due to its competitive performance in multiple industrial application domains. As the core of DL-enabled systems, deep neural networks automatically learn knowledge from carefully collected and organized training data to gain the ability to predict the label of unseen data. Similar to the traditional software systems that need to be comprehensively tested, DNNs also need to be carefully evaluated to make sure the quality of the trained model meets the demand. In practice, the de facto standard to assess the quality of DNNs in industry is to check their performance (accuracy) on a collected set of labeled test data. However, preparing such labeled data is often not easy partly because of the huge labeling effort, i.e., data labeling is labor-intensive, especially with the massive new incoming unlabeled data every day. Recent studies show that test selection for DNN is a promising direction that tackles this issue by selecting minimal representative data to label and using these data to assess the model. However, it still requires human effort and cannot be automatic. In this paper, we propose a novel technique, named Aries, that can estimate the performance of DNNs on new unlabeled data using only the information obtained from the original test data. The key insight behind our technique is that the model should have similar prediction accuracy on the data which have similar distances to the decision boundary. We performed a large-scale evaluation of our technique on 13 types of data transformation methods. The results demonstrate the usefulness of our technique that the estimated accuracy by Aries is only 0.03% -- 2.60% (on average 0.61%) off the true accuracy. Besides, Aries also outperforms the state-of-the-art selection-labeling-based methods in most (96 out of 128) cases.
翻译:深层次的学习由于在多种工业应用领域的竞争性性能,在我们的日常生活中发挥着越来越重要的作用。作为DL驱动的系统的核心,深神经网络自动从精心收集和有组织的培训数据中学习知识,以获得预测隐性数据标签的能力。与需要全面测试的传统软件系统相似,DNN也需要仔细评估,以确保经过培训的模式的质量符合需求。在实践中,评估工业中DNNS质量的实际标准是检查其在一套已收集的标签测试数据中的性能(准确性)。然而,作为DL驱动的系统的核心,深神经网络自动地从精心收集和组织的培训数据中学习知识,从而获得知识,但这种数据往往不容易。数据标签是劳动密集型的,特别是每天大量新的无标签数据。最近的研究表明,DNNN的测试选择是一个很有希望的方向,通过选择最起码的有代表性的数据标签和这些数据来评估模型。然而,它仍然需要人的努力,而且不能自动地使用。在这个文件中,我们提出了一个名为Arime Arierial的数据, 数据标签是用来评估我们最初的精确性数据类型。