While India remains one of the hotspots of the COVID-19 pandemic, data about the pandemic from the country has proved to be largely inaccessible for use at scale. Much of the data exists in an unstructured form on the web, and limited aspects of such data are available through public APIs maintained manually through volunteer efforts. This has proved to be difficult both in terms of ease of access to detailed data as well as with regards to the maintenance of manual data-keeping over time. This paper reports on a recently launched project aimed at automating the extraction of such data from public health bulletins with the help of a combination of classical PDF parsers as well as state-of-the-art ML-based documents extraction APIs. In this paper, we will describe the automated data-extraction technique, the nature of the generated data, and exciting avenues of ongoing work.
翻译:虽然印度仍然是COVID-19大流行病的热点之一,但印度提供的有关该大流行病的数据基本上无法大规模使用,许多数据在网络上以非结构化的形式存在,这些数据的有限部分是通过志愿工作人工维持的公共API提供的,这在方便获取详细数据方面以及在长期保持人工保存数据方面都证明很困难。本文报告了最近发起的一个项目,其目的是在传统的PDF分析器和以ML为基础的最先进的文件提取API的综合帮助下,从公共卫生公报中自动提取此类数据。本文将介绍自动化数据传送技术、生成数据的性质以及令人振奋的工作途径。本文将介绍自动化的数据传送技术、生成数据的性质。