Typically, information extraction (IE) requires a pipeline approach: first, a sequence labeling model is trained on manually annotated documents to extract relevant spans; then, when a new document arrives, a model predicts spans which are then post-processed and standardized to convert the information into a database entry. We replace this labor-intensive workflow with a transformer language model trained on existing database records to directly generate structured JSON. Our solution removes the workload associated with producing token-level annotations and takes advantage of a data source which is generally quite plentiful (e.g. database records). As long documents are common in information extraction tasks, we use gradient checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single GPU. Our Doc2Dict approach is competitive with more complex, hand-engineered pipelines and offers a simple but effective baseline for document-level information extraction. We release our Doc2Dict model and code to reproduce our experiments and facilitate future work.
翻译:通常,信息提取(IE)需要一种编审方法:首先,一个序列标签模式在人工加注的文件上经过培训,以抽取相关的间隔;然后,当新文件到达时,一个模型预测出将信息转换为数据库条目的后处理和标准化的跨度。我们用以现有数据库记录培训的变压器语言模式取代这一劳动密集型工作流程,直接生成结构化的JSON。我们的解决方案消除了制作象征性说明的工作量,并利用了一个通常相当繁琐的数据源(例如数据库记录)。只要文件在信息提取任务中很常见,我们就使用梯度检查站和块状编码来将我们的方法应用于单个GPU上多达32,000个标记的序列。我们的Doc2Dict方法具有竞争力,与更复杂、手工设计的管道竞争,并为文件级信息提取提供一个简单而有效的基准。我们发布了我们的Doc2Dict模型和代码,以复制我们的实验并为今后的工作提供便利。