用于比较API使用量的图表差异比数值的实验分析 (An Experimental Analysis of Graph-Distance Algorithms for Comparing API Usages)

from arxiv, Accepted Paper at the 21st IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM) in the Replication and Negative Results (RENE) Track

Modern software development heavily relies on the reuse of functionalities through Application Programming Interfaces (APIs). However, client developers can have issues identifying the correct usage of a certain API, causing misuses accompanied by software crashes or usability bugs. Therefore, researchers have aimed at identifying API misuses automatically by comparing client code usages to correct API usages. Some techniques rely on certain API-specific graph-based data structures to improve the abstract representation of API usages. Such techniques need to compare graphs, for instance, by computing distance metrics based on the minimal graph edit distance or the largest common subgraphs, whose computations are known to be NP-hard problems. Fortunately, there exist many abstractions for simplifying graph distance computation. However, their applicability for comparing graph representations of API usages has not been analyzed. In this paper, we provide a comparison of different distance algorithms of API-usage graphs regarding correctness and runtime. Particularly, correctness relates to the algorithms' ability to identify similar correct API usages, but also to discriminate similar correct and false usages as well as non-similar usages. For this purpose, we systematically identified a set of eight graph-based distance algorithms and applied them on two datasets of real-world API usages and misuses. Interestingly, our results suggest that existing distance algorithms are not reliable for comparing API usage graphs. To improve on this situation, we identified and discuss the algorithms' issues, based on which we formulate hypotheses to initiate research on overcoming them.

翻译：现代软件开发在很大程度上依赖于通过应用程序程序接口(API)对功能的再利用。然而,客户开发者可能会发现某些API正确使用的问题,导致软件崩溃或易用错误的误用。因此,研究人员的目标是通过比较客户代码使用量来自动识别API误用情况,以纠正API使用量。有些技术依靠某些特定API图形数据结构来改进API使用量的抽象表达方式。例如,这种技术需要比较图表,例如,根据最小图形编辑距离或最大通用子集计算距离量,其计算方法已知是NP-硬问题。幸运的是,在简化图形距离计算方面有许多抽象数据。然而,它们对于比较API使用量的图表使用量的可适用性还没有进行过分析。在本文件中,我们比较了基于API-usage 图表的关于正确性和运行时间的不同距离算法的比较。特别是,这种正确性与算法在确定类似正确使用量的能力有关,但也歧视类似的正确和错误使用量以及非类似使用量的计算方法。幸运的是,在简化的距离计算方法中有许多抽象的计算方法。