In the recent years, Portable Document Format, commonly known as PDF, has become a democratized standard for document exchange and dissemination. This trend has been due to its characteristics such as its flexibility and portability across platforms. The widespread use of PDF has installed a false impression of inherent safety among benign users. However, the characteristics of PDF motivated hackers to exploit various types of vulnerabilities, overcome security safeguards, thereby making the PDF format one of the most efficient malicious code attack vectors. Therefore, efficiently detecting malicious PDF files is crucial for information security. Several analysis techniques has been proposed in the literature, be it static or dynamic, to extract the main features that allow the discrimination of malware files from benign ones. Since classical analysis techniques may be limited in case of zero-days, machine-learning based techniques have emerged recently as an automatic PDF-malware detection method that is able to generalize from a set of training samples. These techniques are themselves facing the challenge of evasion attacks where a malicious PDF is transformed to look benign. In this work, we give an overview on the PDF-malware detection problem. We give a perspective on the new challenges and emerging solutions.
翻译:近些年来,PDF的便携式文件格式(通常称为PDF)已成为文件交换和传播的民主化标准,这一趋势是因为它具有灵活性和跨平台的可移动性等特点。PDF的广泛使用在良性使用者中造成了对固有安全的错误印象。然而,PDF动机黑客利用各种弱点、克服安全保障,从而使PDF格式成为最有效的恶意代码攻击矢量之一。因此,有效发现恶意PDF文件对于信息安全至关重要。文献中提出了几种分析技术,无论是静态还是动态分析技术,以提取允许恶意软件文件从良性文件中受歧视的主要特征。由于经典分析技术在零日情况下可能受到限制,基于机器学习的技术最近作为一种PDFM的自动软件检测方法出现。这些技术本身面临着躲避攻击的挑战,而恶意PDF的PDF软件正在转变为良性。我们在此工作中概述了PDF软件的检测问题。我们对新的挑战和正在出现的解决办法提出了一个观点。