Natural language processing researchers develop models of grammar, meaning and human communication based on written text. Due to task and data differences, what is considered text can vary substantially across studies. A conceptual framework for systematically capturing these differences is lacking. We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. Towards that goal, we propose common terminology to discuss the production and transformation of textual data, and introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling. We apply this taxonomy to survey existing work that extends the notion of text beyond the conservative language-centered view. We outline key desiderata and challenges of the emerging inclusive approach to text in NLP, and suggest systematic community-level reporting as a crucial next step to consolidate the discussion.
翻译:自然语言处理研究人员根据书面文本制定语法、含义和人文交流模式。由于任务和数据差异,不同研究中被视为文字的内容会有很大差异。系统捕捉这些差异的概念框架缺乏明确性。我们认为,文字概念对于复制和普遍适用NLP至关重要。 为实现这一目标,我们提出了共同术语,以讨论文本数据的制作和转换,并引入了语言和非语言要素的双层分类,这些都存在于文本来源中,可用于NLP建模。我们运用这一分类法调查将文字概念扩大到保守语言中心观点之外的现有工作。我们概述了正在形成的对NLP文本的包容性方法的关键侧面和挑战,并建议系统化的社区一级报告是巩固讨论的下一个关键步骤。