While India has been one of the hotspots of COVID-19, data about the pandemic from the country has proved to be largely inaccessible at scale. Much of the data exists in unstructured form on the web, and limited aspects of such data are available through public APIs maintained manually through volunteer effort. This has proved to be difficult both in terms of ease of access to detailed data and with regards to the maintenance of manual data-keeping over time. This paper reports on our effort at automating the extraction of such data from public health bulletins with the help of a combination of classical PDF parsers and state-of-the-art machine learning techniques. In this paper, we will describe the automated data-extraction technique, the nature of the generated data, and exciting avenues of ongoing work.
翻译:虽然印度是COVID-19的热点之一,但印度关于这一流行病的数据在规模上基本上无法获取,许多数据在网上以非结构化的形式存在,这些数据的有限方面是通过志愿工作人工维持的公共API提供的,这在方便获取详细数据方面和在保持人工保存数据方面都证明是困难的。本文报告了我们在传统PDF分析器和最先进的机器学习技术的结合下,努力从公共卫生公报中自动提取此类数据的情况。本文将介绍自动化数据传送技术、生成数据的性质以及令人振奋的进行中工作的途径。