The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in two distinct class-level software engineering tasks: Malicious Code Localization and Defect Prediction. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
翻译:由于机器学习(ML),使越来越多的软件工程任务自动化成为可能。在将ML应用到软件手工艺品的过程中,一个基本基础组成部分是将这些艺术品(例如源代码或可执行代码)以适合学习的形式呈现为一种形式。许多研究利用了代表性学习,将自动设计适当演示的任务下放给ML本身。然而,在Anderroid问题的背景下,现有模型要么局限于粗糙的全应用水平(例如,appk2vec),或者为一项具体的下游任务(例如,sali2vec)。在应用ML软件的过程中,一个基本组成部分是将这些文物(例如源代码或可执行代码)转化为适合学习的形式。我们的工作是一套新的研究系列的一部分,该系列调查有效、任务敏感度和精细度通用的代码表达方式,以缓解这两种局限性。这种演示的目的是获取与各种低层次任务(例如,在级别上,我们使用的模型)相关的信息。我们受到自然语言处理领域的启发,通过建立通用语言模型来解决普遍代表性问题,例如,BERT(例如)BERT)中,其目标的精度是用来测量了DLILLLLI 的智能任务的精度,其精度,其精度是用来测量性任务的精度。