Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function mapping is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining. In this paper, we investigate the effect of function inlining on binary similarity analysis. We first construct 4 inlining-oriented datasets for four similarity analysis tasks, including code search, OSS reuse detection, vulnerability detection, and patch presence test. Then, we further study the extent of function inlining, the performance of existing works under function inlining, and the effectiveness of existing inlining-simulation strategies. Results show that the proportion of function inlining can reach nearly 70%, while most existing works neglect it and use "1-to-1" mechanism. The mismatches cause a 30% loss in performance during code search and a 40% loss during vulnerability detection. Moreover, two existing inlining-simulation strategies can only recover 60% of the inlined functions. We discover that inlining is usually cumulative when optimization increases. Conditional inlining and incremental inlining are suggested to design low-cost and high-coverage inlining-simulation strategies.
翻译:对许多与代码相关联的问题来说,相似性分析至关重要,“一至一”机制被广泛应用,其中二进文件中的一项功能与源文件或二进文件中的一项功能匹配。然而,我们发现功能映射是一个更为复杂的“1-n”甚至“n-n-n”问题,因为存在外衬功能。在本文件中,我们调查了“内衬”功能对二进制相似性分析的影响。我们首先为四种类似性分析任务建立了4个面向内衬的数据集,包括代码搜索、开放源码软件再利用检测、脆弱性检测和补丁存在测试。然后,我们进一步研究了功能映射范围、功能中的现有工作绩效以及现有模拟战略的有效性。结果显示,“内嵌”功能的比例可以达到近70%,而大多数现有工作忽视了它,使用了“一至1”机制。不匹配导致在代码搜索期间业绩损失30%,在识别脆弱性过程中损失40%。此外,两个现有的“内衬”战略的大小范围范围、功能缩插范围、功能的绩效通常只有60 %。我们提出的“升级”战略在升级中要恢复。