Deep Learning (DL) compilers are widely adopted to optimize advanced DL models for efficient deployment on diverse hardware. Their quality has profound effect on the quality of compiled DL models. A recent bug study shows that the optimization of high-level intermediate representation (IR) is the most error-prone compilation stage. Bugs in this stage are accountable for 44.92% of the whole collected ones. However, existing testing techniques do not consider high-level optimization related features (e.g. high-level IR), and are therefore weak in exposing bugs at this stage. To bridge this gap, we propose HirGen, an automated testing technique that aims to effectively expose coding mistakes in the optimization of high-level IR. The design of HirGen includes 1) three coverage criteria to generate diverse and valid computational graphs; 2) full use of high-level IRs language features to generate diverse IRs; 3) three test oracles inspired from both differential testing and metamorphic testing. HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs confirmed and 12 fixed. Further, we construct four baselines using the state-of-the-art DL compiler fuzzers that can cover the high-level optimization stage. Our experiment results show that HirGen can detect 10 crashes and inconsistencies that cannot be detected by the baselines in 48 hours. We further validate the usefulness of our proposed coverage criteria and test oracles in evaluation.
翻译:摘要:深度学习(DL)编译器被广泛采用,以优化先进的DL模型,使其在各种硬件上高效部署。其质量对编译的DL模型的质量产生深远影响。最近的错误研究表明,高级中间表示(IR)的优化是最容易出现错误的编译阶段。在整个收集到的错误中,此阶段的错误占44.92%。然而,现有的测试技术不考虑高级优化相关特征(例如高级IR),因此在该阶段暴露错误方面较弱。为了弥补这一差距,我们提出了HirGen,一种自动化测试技术,旨在有效地暴露高级IR优化中的编码错误。HirGen的设计包括1)三个覆盖准则来生成多样化且有效的计算图;2)充分利用高级IR语言功能来生成多样化的IR;3)三个测试Oracle,灵感来自差分测试和变形测试。HirGen已成功检测到TVM中发生的21个错误,其中17个错误得到了确认和12个错误得到了修复。此外,我们使用最先进的DL编译器模糊器构建了四个基线,可以覆盖高级优化阶段。我们的实验结果表明,HirGen在48小时内可以检测到10个无法被基线检测到的崩溃和不一致性。我们进一步验证了我们提出的覆盖准则和测试Oracle在评估中的有用性。