可合成 NLP 工作流程数据中心框架 (A Data-Centric Framework for Composable NLP Workflows)

Zhengzhong Liu,Guanxiong Ding,Avinash Bukkittu,Mansi Gupta,Pengzhi Gao,Atif Ahmed,Shikun Zhang,Xin Gao,Swapnil Singhavi,Linwei Li,Wei Wei,Zecong Hu,Haoran Shi,Xiaodan Liang,Teruko Mitamura,Eric P. Xing,Zhiting Hu

from arxiv, 8 pages, 4 figures, EMNLP 2020

Empirical natural language processing (NLP) systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components, ranging from data ingestion, human annotation, to text retrieval, analysis, generation, and visualization. We establish a unified open-source framework to support fast development of such sophisticated NLP workflows in a composable manner. The framework introduces a uniform data representation to encode heterogeneous results by a wide range of NLP tasks. It offers a large repository of processors for NLP tasks, visualization, and annotation, which can be easily assembled with full interoperability under the unified representation. The highly extensible framework allows plugging in custom processors from external off-the-shelf NLP and deep learning libraries. The whole framework is delivered through two modularized yet integratable open-source projects, namely Forte1 (for workflow infrastructure and NLP function processors) and Stave2 (for user interaction, visualization, and annotation).

翻译：应用领域(例如,保健、金融、教育)的经验性自然语言处理系统(NLP)在应用领域(例如,保健、金融、教育)涉及多个组成部分之间的相互协作,从数据摄取、人文注解到文字检索、分析、生成和可视化等,我们建立了一个统一的开放源框架,以支持以可比较的方式快速开发这种复杂的NLP工作流程。框架采用统一的数据代表方式,通过一系列广泛的NLP任务将差异结果编码起来。它为NLP任务、可视化和注解提供了庞大的处理器库,这些处理器可以很容易地组装,在统一代表制下具有完全的互操作性。高度可扩展的框架允许从外部现成的NLP和深层学习图书馆中插入定制的处理器。整个框架是通过两个模块化的但又无法移动的开放源码项目交付的,即Forte1(用于工作流程基础设施和NLP功能处理器)和Stave2(用于用户互动、可视化和注解)。