While humans can extract information from unstructured text with high precision and recall, this is often too time-consuming to be practical. Automated approaches, on the other hand, produce nearly-immediate results, but may not be reliable enough for high-stakes applications where precision is essential. In this work, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only information extraction approaches. We argue for the utility of a human-in-the-loop approach in applications where high precision is required, but purely manual extraction is infeasible. We present a framework and an accompanying tool for information extraction using weak-supervision labelling with human validation. We demonstrate our approach on three criminal justice datasets. We find that the combination of computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms fully automated baselines in terms of precision.
翻译:虽然人类可以从非结构化文本中以高精确度和高回顾方式提取信息,但这往往太费时,不切实际。 另一方面,自动化方法产生近乎即时的结果,但对于精确度至关重要的高摄入应用而言可能不够可靠。在这项工作中,我们考虑了各种单人、人机和机器信息提取方法的利弊。我们主张在需要高精确度但纯粹人工提取是行不通的应用程序中,采用人到流方法的效用。我们提出了一个框架和配套工具,用于使用微弱的监视标签进行信息提取。我们在三个刑事司法数据集上展示了我们的方法。我们发现,计算机速度和人类理解的结合使得精确度与手动说明相仿,同时只需要一点时间,在精确性方面大大超过完全自动化的基线。