Deep learning (DL) has become an integral part of solutions to various important problems, which is why ensuring the quality of DL systems is essential. One of the challenges of achieving reliability and robustness of DL software is to ensure that algorithm implementations are numerically stable. DL algorithms require a large amount and a wide variety of numerical computations. A naive implementation of numerical computation can lead to errors that may result in incorrect or inaccurate learning and results. A numerical algorithm or a mathematical formula can have several implementations that are mathematically equivalent, but have different numerical stability properties. Designing numerically stable algorithm implementations is challenging, because it requires an interdisciplinary knowledge of software engineering, DL, and numerical analysis. In this paper, we study two mature DL libraries PyTorch and Tensorflow with the goal of identifying unstable numerical methods and their solutions. Specifically, we investigate which DL algorithms are numerically unstable and conduct an in-depth analysis of the root cause, manifestation, and patches to numerical instabilities. Based on these findings, we launch, the first database of numerical stability issues and solutions in DL. Our findings and provide future references to developers and tool builders to prevent, detect, localize and fix numerically unstable algorithm implementations. To demonstrate that, using {\it DeepStability} we have located numerical stability issues in Tensorflow, and submitted a fix which has been accepted and merged in.
翻译:深层次学习(DL)已成为解决各种重要问题的方法的一个组成部分,这就是为什么确保DL系统的质量至关重要。实现DL软件的可靠性和稳健性的挑战之一是确保算法的实施在数字上稳定。DL算法需要大量和各种各样的数字计算。天真地执行数字计算可能导致错误,可能导致不正确或不准确的学习和结果。数字算法或数学公式可以有若干在数学上等同但具有不同数字稳定性特性的功能。设计数字稳定的算法执行具有挑战性,因为它需要软件工程、DL和数字分析的跨学科知识。在本论文中,我们研究了两个成熟的DL图书馆PyTorrch和Tensor流程,目的是查明不稳定的数字方法及其解决办法。具体地说,我们调查的是哪些DL算法在数字上不稳定,对根本原因、表现和数字不稳定性进行深入分析。基于这些发现,我们启动了第一个数字稳定性问题数据库,防止DL的软件工程、DL软件和数字分析方法的跨学科性,我们用数字性工具的稳定性来测量和稳定性,我们用数字工具的固定性来测量和稳定性来测量和稳定性。