Organizations publish and share more and more electronic documents like PDF files. Unfortunately, most organizations are unaware that these documents can compromise sensitive information like authors names, details on the information system and architecture. All these information can be exploited easily by attackers to footprint and later attack an organization. In this paper, we analyze hidden data found in the PDF files published by an organization. We gathered a corpus of 39664 PDF files published by 75 security agencies from 47 countries. We have been able to measure the quality and quantity of information exposed in these PDF files. It can be effectively used to find weak links in an organization: employees who are running outdated software. We have also measured the adoption of PDF files sanitization by security agencies. We identified only 7 security agencies which sanitize few of their PDF files before publishing. Unfortunately, we were still able to find sensitive information within 65% of these sanitized PDF files. Some agencies are using weak sanitization techniques: it requires to remove all the hidden sensitive information from the file and not just to remove the data at the surface. Security agencies need to change their sanitization methods.
 翻译:各组织公布并分享越来越多的电子文件,如PDF文件。 不幸的是,大多数组织不知道这些文件会损害敏感信息,如作者姓名、信息系统和架构的细节等。 所有这些信息都可以轻易被攻击者利用,以便留下足迹,然后攻击某个组织。在本文中,我们分析了一个组织公布的PDF文件中发现的隐藏数据。我们收集了来自47个国家的75个安全机构公布的39664 PDF文件汇编。我们得以测量了这些PDF文件中所披露信息的质量和数量。我们可以有效地利用这些文件来发现一个组织中的薄弱环节:正在运行过时软件的雇员。我们还测量了安全机构采用PDF文件的清洁程度。我们在公布PDF文件之前,只发现了7个安全机关,但不幸的是,我们仍能在这些已清理的PDF文件中的65%内找到敏感信息。一些机构正在使用薄弱的防污技术:它需要从档案中删除所有隐藏的敏感信息,而不只是在表面删除数据。安全机构需要改变其防污方法。