与 " 生来数字政府出版物的规模:走向处理和搜索数百万人民国防军的管道 " 相结合 (Grappling with the Scale of Born-Digital Government Publications: Toward Pipelines for Processing and Searching Millions of PDFs) - 专知论文

会员服务 ·

0

缩放 · Processing（编程语言） · 可理解性 · DATE · Machine Learning ·

2021 年 12 月 5 日

Grappling with the Scale of Born-Digital Government Publications: Toward Pipelines for Processing and Searching Millions of PDFs

翻译：与 " 生来数字政府出版物的规模:走向处理和搜索数百万人民国防军的管道 " 相结合

Benjamin Charles Germain Lee,Trevor Owens

from arxiv, 22 pages, 4 figures

Official government publications are key sources for understanding the history of societies. Web publishing has fundamentally changed the scale and processes by which governments produce and disseminate information. Significantly, a range of web archiving programs have captured massive troves of government publications. For example, hundreds of millions of unique U.S. Government documents posted to the web in PDF form have been archived by libraries to date. Yet, these PDFs remain largely unutilized and understudied in part due to the challenges surrounding the development of scalable pipelines for searching and analyzing them. This paper utilizes a Library of Congress dataset of 1,000 government PDFs in order to offer initial approaches for searching and analyzing these PDFs at scale. In addition to demonstrating the utility of PDF metadata, this paper offers computationally-efficient machine learning approaches to search and discovery that utilize the PDFs' textual and visual features as well. We conclude by detailing how these methods can be operationalized at scale in order to support systems for navigating millions of PDFs.

翻译：官方政府出版物是了解社会历史的关键来源。网络出版从根本上改变了政府制作和传播信息的规模和程序。重要的是,一系列网络存档程序捕捉了大量政府出版物,例如,迄今为止,以PDF格式张贴在网上的数亿个独特的美国政府文件已由图书馆存档。然而,这些PDF在很大程度上仍然没有使用和研究不足,部分原因是在开发可扩缩的搜索分析管道方面存在挑战。本文利用国会数据库的1,000个政府PDF数据集,为大规模搜索分析这些PDF提供初步方法。除了展示PDF元数据的效用外,本文还提供计算高效的机器学习方法,搜索和发现,利用PDFS的文字和视觉特征。我们最后通过详细说明这些方法如何大规模操作,以支持数百万PDF的导航系统。

0

相关内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

已删除

将门创投

6+阅读 · 2019年7月11日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

开放知识图谱

5+阅读 · 2019年4月16日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

美国化学会 (ACS) 北京代表处招聘

美国化学会 (ACS) 北京代表处招聘

知社学术圈

11+阅读 · 2018年9月4日

机器人开发库软件大列表

机器人开发库软件大列表

专知

10+阅读 · 2018年3月18日

Blockchain-based Digital Twin for Supply Chain Management: A Literature Review and Future Research Directions

Blockchain-based Digital Twin for Supply Chain Management: A Literature Review and Future Research Directions

Arxiv

0+阅读 · 2022年2月8日

Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions

Arxiv

0+阅读 · 2022年2月6日

A bibliometric investigation into the literature of semantic reasoning in Internet of Things

Arxiv

0+阅读 · 2022年2月5日

OpenStreetMap data use cases during the early months of the COVID-19 pandemic

Arxiv

0+阅读 · 2022年2月4日

One-Year In: COVID-19 Research at the International Level in CORD-19 Data

Arxiv

0+阅读 · 2022年2月1日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

A Survey on the Evolution of Stream Processing Systems

A Survey on the Evolution of Stream Processing Systems

Arxiv

9+阅读 · 2020年8月3日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

A Benchmark Study on Sentiment Analysis for Software Engineering Research

Arxiv

3+阅读 · 2018年3月17日

The Case for Automatic Database Administration using Deep Reinforcement Learning

Arxiv

3+阅读 · 2018年1月17日

VIP会员

文章信息

相关主题

Processing（编程语言）

Machine Learning

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《军事域人工智能风险、机遇与治理战略指导报告》2025最新76页报告

《杀伤网与精确规模：智能饱和战争时代的战略要务-印度视角》2025最新报告

俄乌冲突的地缘政治与军事教训（万字长文）

《弹药快速效能建模：推进互操作性与技术优势》2025最新26页报告

相关资讯

已删除

将门创投

6+阅读 · 2019年7月11日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

学术会议 | 知识图谱顶会 ISWC 征稿：Poster/Demo

开放知识图谱

5+阅读 · 2019年4月16日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

美国化学会 (ACS) 北京代表处招聘

美国化学会 (ACS) 北京代表处招聘

知社学术圈

11+阅读 · 2018年9月4日

机器人开发库软件大列表

机器人开发库软件大列表

专知

10+阅读 · 2018年3月18日

相关论文

Blockchain-based Digital Twin for Supply Chain Management: A Literature Review and Future Research Directions

Blockchain-based Digital Twin for Supply Chain Management: A Literature Review and Future Research Directions

Arxiv

0+阅读 · 2022年2月8日

Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions

Arxiv

0+阅读 · 2022年2月6日

A bibliometric investigation into the literature of semantic reasoning in Internet of Things

Arxiv

0+阅读 · 2022年2月5日

OpenStreetMap data use cases during the early months of the COVID-19 pandemic

Arxiv

0+阅读 · 2022年2月4日

One-Year In: COVID-19 Research at the International Level in CORD-19 Data

Arxiv

0+阅读 · 2022年2月1日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

A Survey on the Evolution of Stream Processing Systems

A Survey on the Evolution of Stream Processing Systems

Arxiv

9+阅读 · 2020年8月3日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

A Benchmark Study on Sentiment Analysis for Software Engineering Research

Arxiv

3+阅读 · 2018年3月17日

The Case for Automatic Database Administration using Deep Reinforcement Learning

Arxiv

3+阅读 · 2018年1月17日

微信扫码咨询专知VIP会员