This article emphasizes that NLP as a science seeks to make inferences about the performance effects that result from applying one method (compared to another method) in the processing of natural language. Yet NLP research in practice usually does not achieve this goal: In NLP research articles, typically only a few models are compared. Each model results from a specific procedural pipeline (here named processing system) that is composed of a specific collection of methods that are used in preprocessing, pretraining, hyperparameter tuning, and training on the target task. To make generalizing inferences about the performance effect that is caused by applying some method A vs. another method B, it is not sufficient to compare a few specific models that are produced by a few specific (probably incomparable) processing systems. Rather, the following procedure would allow drawing inferences about methods' performance effects: (1) A population of processing systems that researchers seek to infer to has to be defined. (2) A random sample of processing systems from this population is drawn. (The drawn processing systems in the sample will vary with regard to the methods they apply along their procedural pipelines and also will vary regarding the compositions of their training and test data sets used for training and evaluation.) (3) Each processing system is applied once with method A and once with method B. (4) Based on the sample of applied processing systems, the expected generalization errors of method A and method B are approximated. (5) The difference between the expected generalization errors of method A and method B is the estimated average treatment effect due to applying method A compared to method B in the population of processing systems.
翻译:本条强调,作为科学,国家实验室规划方案力求对在自然语言处理中采用一种方法(相对于另一种方法)所产生的性能效果作出推论;然而,国家实验室规划方案的实际研究通常没有实现这一目标:在国家实验室规划研究文章中,通常只比较几个模型;每个模型都来自具体的程序管道(此处称为处理系统),其中包括在预处理、预培训、超参数调和培训中所使用的具体方法的集合;为了对采用某种方法(相对于另一种方法)对自然语言处理中产生的性能效果效果作出推论;为了对采用某种方法(相对于另一种方法B)造成的性能效果作一般性推论,仅仅比较几个具体模型是不够的:在《国家实验室规划方案》的研究文章中,通常只比较几个特定(可能无法比较的)处理系统所产生的具体模型。