GovScape：面向7000万页政府PDF文档的公共多模态检索系统 (GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs)

Kyle Deeds,Ying-Hsiang Huang,Claire Gong,Shreya Shaji,Alison Yan,Leslie Harka,Samuel J Klein,Shannon Zejiang Shen,Mark Phillips,Trevor Owens,Benjamin Charles Germain Lee

from arxiv, 10 pages, 5 figures, 2 tables

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.

翻译：过去三十年的努力已构建了包含数十亿网页快照和PB级数据的网络存档库。仅'任期结束网络存档'中就存有联邦政府生成的数百万份PDF文件（含其他文件类型）。尽管网络存档在保存方面成效显著，但在访问与可发现性方面仍存在重大挑战。例如，当前对'任期结束PDF'的浏览功能仅限于下载并查看单个PDF文件，以及执行基础的关键词跨文档检索。本文介绍GovScape——一个支持对2020年'任期结束'爬取数据中10,015,993份联邦政府PDF文档（总计70,958,487页）进行多模态检索的公共检索系统。据我们所知，该系统涵盖了2020年爬取数据中所有可渲染且不超过50页的PDF文档。GovScape为这千万级PDF提供四种核心检索模式：除支持（1）基于域名、爬取日期等元数据维度的筛选条件，以及（2）针对PDF文本的精确文本检索外，我们还提供（3）语义文本检索与（4）面向单页PDF的视觉检索，使用户能构建如'经编辑处理的文档'或'饼状图'等结构化查询。本文详述GovScape的构成组件，包括检索功能、嵌入向量生成流程、系统架构及开源代码库。值得注意的是，针对千万级PDF的预处理流程总计算成本约为1,500美元，相当于每美元计算资源可处理47,000页PDF，这证明了系统具备即时扩展的潜力。基于此，我们概述了已着手推进的面向亿级PDF规模的多模态检索实施路径。GovScape可通过https://www.govscape.net访问。