Despite its pivotal role in research experiments, code correctness is often presumed only on the basis of the perceived quality of the results. This comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on result reproducibility should go hand in hand with the emphasis on coding best practices. We bolster our call to the NLP community by presenting a case study, in which we identify (and correct) three bugs in widely used open-source implementations of the state-of-the-art Conformer architecture. Through comparative experiments on automatic speech recognition and translation in various language settings, we demonstrate that the existence of bugs does not prevent the achievement of good and reproducible results and can lead to incorrect conclusions that potentially misguide future research. In response to this, this study is a call to action toward the adoption of coding best practices aimed at fostering correctness and improving the quality of the developed software.
翻译:----
重现性与正确性并重:在NLP中测试代码的重要性
翻译摘要:
尽管程序代码在研究实验中具有关键作用,但程序正确性往往只根据结果质量来推断。这会带来错误结果和潜在的误导性研究发现的风险。为了解决这个问题,我们认为当前对结果可重现性的关注应该与对编码最佳实践的重视相结合。我们通过一个案例研究来支持我们对NLP社区的呼吁,在该案例研究中,我们发现(并纠正)了现有的开源Conformer架构实现中的三个错误。通过在不同语言设置下进行自动语音识别和翻译的比较实验,我们证明了存在错误并不妨碍获得良好且可重复的结果,而且可能导致不正确的结论,从而潜在地误导今后的研究。作为回应,这项研究呼吁采用旨在促进正确性和提高开发软件质量的编码最佳实践。