Deep learning (DL) plays a more and more important role in our daily life due to its competitive performance in industrial application domains. As the core of DL-enabled systems, deep neural networks (DNNs) need to be carefully evaluated to ensure the produced models match the expected requirements. In practice, the \emph{de facto standard} to assess the quality of DNNs in the industry is to check their performance (accuracy) on a collected set of labeled test data. However, preparing such labeled data is often not easy partly because of the huge labeling effort, i.e., data labeling is labor-intensive, especially with the massive new incoming unlabeled data every day. Recent studies show that test selection for DNN is a promising direction that tackles this issue by selecting minimal representative data to label and using these data to assess the model. However, it still requires human effort and cannot be automatic. In this paper, we propose a novel technique, named \textit{Aries}, that can estimate the performance of DNNs on new unlabeled data using only the information obtained from the original test data. The key insight behind our technique is that the model should have similar prediction accuracy on the data which have similar distances to the decision boundary. We performed a large-scale evaluation of our technique on two famous datasets, CIFAR-10 and Tiny-ImageNet, four widely studied DNN models including ResNet101 and DenseNet121, and 13 types of data transformation methods. Results show that the estimated accuracy by \textit{Aries} is only 0.03\% -- 2.60\% off the true accuracy. Besides, \textit{Aries} also outperforms the state-of-the-art labeling-free methods in 50 out of 52 cases and selection-labeling-based methods in 96 out of 128 cases.
翻译:深度学习 (DL) 因其在工业应用领域的竞争性性能,在我们的日常生活中发挥了越来越重要的作用 。作为DL 驱动的系统的核心,深神经网络(DNN) 需要仔细评估,以确保所制作的模型符合预期要求。在实践中,评估该行业DNN质量的 emph{事实上的标准 是检查所收集的一组标签测试数据的性能(准确性) 。然而,由于在工业应用领域的竞争性性能 {DL }, 编制这类标签数据往往不易。 部分原因在于, 数据标签的准确性是劳动密集型的, 特别是每天收到大量新的无标签数据。 最近的研究显示, DNNNNN的测试选择是一个很有希望的方向, 通过选择最起码的代表性数据进行标签,使用这些数据来评估模型。 但是, 仍然需要人的努力, 并且不能自动。 在本文中, 我们只提出一种名为\ textitilitititilital {A lidal dal at 新的无标签数据上的性能通过原始的模型来评估。