Recent deep learning (DL) applications are mostly built on top of DL libraries. The quality assurance of these libraries is critical to the dependable deployment of DL applications. A few techniques have thereby been proposed to test DL libraries by generating DL models as test inputs. Then these techniques feed those DL models to DL libraries for making inferences, in order to exercise DL libraries modules related to a DL model's execution. However, the test effectiveness of these techniques is constrained by the diversity of generated DL models. Our investigation finds that these techniques can cover at most 11.7% of layer pairs (i.e., call sequence between two layer APIs) and 55.8% of layer parameters (e.g., "padding" in Conv2D). As a result, we find that many bugs arising from specific layer pairs and parameters can be missed by existing techniques. In view of the limitations of existing DL library testing techniques, we propose MEMO to efficiently generate diverse DL models by exploring layer types, layer pairs, and layer parameters. MEMO: (1) designs an initial model reduction technique to boost test efficiency without compromising model diversity; and (2) designs a set of mutation operators for a customized Markov Chain Monte Carlo (MCMC) algorithm to explore new layer types, layer pairs, and layer parameters. We evaluate MEMO on seven popular DL libraries, including four for model execution (TensorFlow, PyTorch and MXNet, and ONNX) and three for model conversions (Keras-MXNet, TF2ONNX, ONNX2PyTorch). The evaluation result shows that MEMO outperforms recent works by covering 10.3% more layer pairs, 15.3% more layer parameters, and 2.3% library branches. Moreover, MEMO detects 29 new bugs in the latest version of DL libraries, with 17 of them confirmed by DL library developers, and 5 of those confirmed bugs have been fixed.
翻译:最近深层次学习( DL) 应用程序大多建在 DL 库顶部。 这些图书馆的质量保证对于可靠地部署 DL 应用程序至关重要。 因此, 提议了一些技术来测试 DL 库, 将 DL 模型作为测试投入。 然后这些技术将这些 DL 模型喂DL 库进行推导, 以便练习与 DL 模型执行有关的 DL 库模块。 然而, 这些技术的测试效力受到生成 DL 模型的多样性限制。 我们的调查发现, 这些技术最多可以覆盖11.7%的层对子( 即两个层 API 之间的呼叫序列) 和55.8%的层参数( 例如, Conv2DD 的“ 刷” 模型。 ) 结果, 我们发现, 与 DL 层 DL 相关的 DL 模型测试技术的D 模型 DMMO, 我们建议通过探索层类型、 底层对 DL 和底层参数来高效生成多种 DL 模型。 MEMO : (1) 设计一个初始模型, 包括 IM IM IM 的升级的MX IMVAL 系统升级的升级的 RIS 系统 。