With the rapid growth of software, using third-party libraries (TPLs) has become increasingly popular. The prosperity of the library usage has provided the software engineers with handful of methods to facilitate and boost the program development. Unfortunately, it also poses great challenges as it becomes much more difficult to manage the large volume of libraries. Researches and studies have been proposed to detect and understand the TPLs in the software. However, most existing approaches rely on syntactic features, which are not robust when these features are changed or deliberately hidden by the adversarial parties. Moreover, these approaches typically model each of the imported libraries as a whole, therefore, cannot be applied to scenarios where the host software only partially uses the library code segments. To detect both fully and partially imported TPLs at the semantic level, we propose ModX, a framework that leverages novel program modularization techniques to decompose the program into finegrained functionality-based modules. By extracting both syntactic and semantic features, it measures the distance between modules to detect similar library module reuse in the program. Experimental results show that ModX outperforms other modularization tools by distinguishing more coherent program modules with 353% higher module quality scores and beats other TPL detection tools with on average 17% better in precision and 8% better in recall.
翻译:随着软件的迅速增长,使用第三方图书馆(TPLs)的软件的迅速增长越来越受欢迎。图书馆的繁荣使用为软件工程师提供了便利和推动程序开发的少数方法。不幸的是,由于管理大量图书馆变得更加困难,这也带来了巨大的挑战。提议进行一些研究和研究,以探测和理解软件中的TPL。然而,大多数现有方法都依靠合成特征,这些特征在对抗方改变或故意隐藏这些特征时并不牢固。此外,这些方法通常每个进口的图书馆都采用模型,因此,无法应用到东道软件只部分使用图书馆代码部分的情景中。为了在语义层次一级探测全部和部分进口的TPL,我们建议采用MedX,这个框架利用新式程序模块化技术将程序分解成精细的基于功能的模块。通过提取合成和语义特征,它测量模块之间的距离,以探测程序内类似的图书馆模块再利用。实验结果显示,在高级模型中,Mox超越了其他模块化的17%质量工具,比其他模块化程度更高,比其他模块化工具更一致。