The prototypical NLP experiment trains a standard architecture on labeled English data and optimizes for accuracy, without accounting for other dimensions such as fairness, interpretability, or computational efficiency. We show through a manual classification of recent NLP research papers that this is indeed the case and refer to it as the square one experimental setup. We observe that NLP research often goes beyond the square one setup, e.g, focusing not only on accuracy, but also on fairness or interpretability, but typically only along a single dimension. Most work targeting multilinguality, for example, considers only accuracy; most work on fairness or interpretability considers only English; and so on. We show this through manual classification of recent NLP research papers and ACL Test-of-Time award recipients. Such one-dimensionality of most research means we are only exploring a fraction of the NLP research search space. We provide historical and recent examples of how the square one bias has led researchers to draw false conclusions or make unwise choices, point to promising yet unexplored directions on the research manifold, and make practical recommendations to enable more multi-dimensional research. We open-source the results of our annotations to enable further analysis at https://github.com/google-research/url-nlp
翻译:原型 NLP 实验对标签的英文数据进行标准架构培训,并优化准确性,而不考虑公平性、可解释性或计算效率等其他层面。我们通过对近期NLP研究论文进行手工分类,表明情况确实如此,称其为平方实验设置。我们观察到,NLP研究往往超越了平方的设置,例如,不仅注重准确性,而且注重公平性或可解释性,而且通常仅注重单一层面。例如,大多数针对多语种的工作只考虑准确性;大多数关于公平性或可解释性的工作只考虑英语;等等。我们通过对最近的NLP研究论文和ACLP Test-Test-Time奖获得者进行手工分类来显示这一点。大多数研究的这种一维性意味着我们只是探索NLP研究搜索空间的一部分。我们提供了一个方块如何导致研究人员得出错误结论或做出不明智的选择的历史和最近的例子。我们提供了一个方块如何导致研究人员得出错误的结论或做出不明智的选择,指出,对研究多重性的研究方向的承诺尚未探讨,并且提出切实可行的建议,以便进行更多多维的多维研究。我们在 ALb/surgo/surgoursalsurgo 上作出分析。