Adobe's Portable Document Format (PDF) is a popular way of distributing view-only documents with a rich visual markup. This presents a challenge to NLP practitioners who wish to use the information contained within PDF documents for training models or data analysis, because annotating these documents is difficult. In this paper, we present PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format. PAWLS is particularly suited for mixed-mode annotation and scenarios in which annotators require extended context to annotate accurately. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes, all of which can be exported in convenient formats for training multi-modal machine learning models. A read-only PAWLS server is available at https://pawls.apps.allenai.org/ and the source code is available at https://github.com/allenai/pawls.
翻译:Adobe的便携文件格式(PDF)是分发只视文件的常用方式,具有丰富的视觉标记。这给NLP实践者提出了挑战,他们希望将PDF文件中所载的信息用于培训模式或数据分析,因为说明这些文件很困难。在本文中,我们向Labels和结构(PAWLS)提供PDF注释,这是专门为PDF文件格式设计的一种新的说明工具。PAWLS特别适合混合模式的注释和情景,其中说明者要求将背景扩展为准确的注解。PAWLS支持基于跨文本的注解、N-ary关系和freeformat、非文字捆绑框,所有这些都可以以方便的方式出口,用于培训多式机器学习模式。一个只读的PAWLS服务器可在https://pawls.aps.allenai.org/上查阅,源代码可在https://github.com/allenai/pawls上查阅。