眼睛之外的故事: 字形位置断开 PDF 文本编辑 (Story Beyond the Eye: Glyph Positions Break PDF Text Redaction)

In the past redaction involved the use of black or white markers or paper cut-outs to obscure content on physical paper. Today many redactions take place on digital PDF documents and redaction is often performed by software tools. Typical redaction tools remove text from PDF documents and draw a black or white rectangle in its place, mimicking a physical redaction. This practice is thought to be secure when the redacted text is removed and cannot be "copy-pasted" from the PDF document. We find this common conception is false -- existing PDF redactions can be broken by precise measurements of non-redacted character positioning information. We develop a deredaction tool for automatically finding and breaking these vulnerable redactions. We report on 11 different redaction tools, finding the majority do not remove redaction-breaking information, including some Adobe Acrobat workflows. We empirically measure the information leaks, finding some redactions leak upwards of 15 bits of information, creating a 32,768-fold reduction in the space of potential redacted texts. We demonstrate a lower bound on the impact of these leaks via a 22,120 document study, including 18,975 Office of the Inspector General (OIG) investigation reports, where we find 769 vulnerable named-entity redactions. We find leaked information reduces the contents for 164 of these redacted names to less than 494 possibilities from a 7 million name dictionary. We show these findings impact by breaking redactions from the Epstein/Maxwell case, Manafort case, and a released Snowden document. Moreover, we develop an efficient algorithm for locating copy-pastable redactions and find over 100,000 poorly redacted words in US court documents. Current PDF text redaction methods are insufficient for named entity protection.

翻译：在过去的编辑过程中, 使用黑或白标记或纸切纸来模糊物理纸张的内容。今天, 许多编辑在数字 PDF 文档中发生, 并且经常用软件工具进行编辑。典型的编辑工具从 PDF 文档中删除文本, 并在其位置绘制黑或白矩形, 模拟物理编辑。当编辑文本被删除后, 无法从 PDF 文档中“ 拷贝刷” 时, 这种做法被认为是安全的。我们发现这个共同的概念是虚假的 -- 现有的 PDF 编辑可以通过对非编辑字符定位信息进行精确的测量来打破。我们开发了一个解析工具, 自动查找和打破这些脆弱的编辑工具。我们发现11种不同的编辑工具, 并不会删除一个黑, 包括一些 Adobe Acrobbat 工作流程。我们通过实验测量信息泄漏了信息, 发现一些向上上移了15位的信息, 使潜在编辑文本的篇幅减少32, 768 。我们展示了这些解析文件的分数, 我们通过一个分解析的 EDRDO, 通过22 Credeal Credeal 来减少了这些调查。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日