Data annotation is an important and necessary task for all NLP applications. Designing and implementing a web-based application that enables many annotators to annotate and enter their input into one central database is not a trivial task. These kinds of web-based applications require a consistent and robust backup for the underlying database and support to enhance the efficiency and speed of the annotation. Also, they need to ensure that the annotations are stored with a minimal amount of redundancy in order to take advantage of the available resources(e.g, storage space). In this paper, we introduce WASA, a web-based annotation system for managing large-scale multilingual Code Switching (CS) data annotation. Although WASA has the ability to perform the annotation for any token sequence with arbitrary tag sets, we will focus on how WASA is used for CS annotation. The system supports concurrent annotation, handles multiple encodings, allows for several levels of management control, and enables quality control measures while seamlessly reporting annotation statistics from various perspectives and at different levels of granularity. Moreover, the system is integrated with a robust language specific date prepossessing tool to enhance the speed and efficiency of the annotation. We describe the annotation and the administration interfaces as well as the backend engine.
翻译:数据说明是所有 NLP 应用程序的重要和必要任务。 设计和实施一个基于网络的应用程序,使许多批注者能够进行批注并将其输入一个中央数据库,这不是一件微不足道的任务。 这些基于网络的应用程序要求基础数据库有一个一致和有力的备份,并支持提高批注的效率和速度。 此外,它们需要确保说明的存储有最低限度的冗余,以便利用现有资源(例如存储空间) 。 在本文中,我们引入了一个基于网络的批注系统,即用于管理大型多语言代码切换(CS)数据说明的基于网络的批注系统。 虽然WASA有能力对带有任意标签的代号序列进行批注,但我们将侧重于如何使用AS说明来提高批注的效率和速度。 系统支持并行的批注,处理多个编码,允许若干级别的管理控制,并允许质量控制措施,同时无缝地报告来自不同视角和不同程度颗粒度的数据。 此外,系统与精密的语言接口整合了系统,以便改进具体语言管理的速度和速度。