Weak supervision has been applied to various Natural Language Understanding tasks in recent years. Due to technical challenges with scaling weak supervision to work on long-form documents, spanning up to hundreds of pages, applications in the document understanding space have been limited. At Lexion, we built a weak supervision-based system tailored for long-form (10-200 pages long) PDF documents. We use this platform for building dozens of language understanding models and have applied it successfully to various domains, from commercial agreements to corporate formation documents. In this paper, we demonstrate the effectiveness of supervised learning with weak supervision in a situation with limited time, workforce, and training data. We built 8 high quality machine learning models in the span of one week, with the help of a small team of just 3 annotators working with a dataset of under 300 documents. We share some details about our overall architecture, how we utilize weak supervision, and what results we are able to achieve. We also include the dataset for researchers who would like to experiment with alternate approaches or refine ours. Furthermore, we shed some light on the additional complexities that arise when working with poorly scanned long-form documents in PDF format, and some of the techniques that help us achieve state-of-the-art performance on such data.
翻译:近些年来,对各种自然语言理解任务实行了薄弱的监督。由于技术挑战,对长式文件的监管薄弱,覆盖长达数百页,文件理解空间的应用有限。在莱克西恩,我们建立了一个针对长式(10-200页长)PDF文件的基于监管的薄弱系统。我们利用这个平台来建立数十种语言理解模型,并成功地将其应用于各个领域,从商业协议到公司组建文件。在这份文件中,我们展示了在时间、劳动力和培训数据有限的情况下,监督性学习的有效性。我们在一个星期的时间里建立了8个高质量的机器学习模型,由3个说明员组成的小组帮助,对300个文件进行数据集处理。我们分享了我们总体结构的一些细节,我们如何利用薄弱的监督,以及我们能够取得哪些成果。我们还包括了研究人员的数据集,他们愿意试验替代方法或改进我们的工作。此外,我们还介绍了在以PDFF格式进行不完善的扫描长式文件时出现的额外复杂性,以及一些我们实现的状态的技术。