In recent years, the need for neutral benchmark studies that focus on the comparison of methods from computational sciences has been increasingly recognised by the scientific community. While general advice on the design and analysis of neutral benchmark studies can be found in recent literature, certain amounts of flexibility always exist. This includes the choice of data sets and performance measures, the handling of missing performance values and the way the performance values are aggregated over the data sets. As a consequence of this flexibility, researchers may be concerned about how their choices affect the results or, in the worst case, may be tempted to engage in questionable research practices (e.g. the selective reporting of results or the post-hoc modification of design or analysis components) to fit their expectations or hopes. To raise awareness for this issue, we use an example benchmark study to illustrate how variable benchmark results can be when all possible combinations of a range of design and analysis options are considered. We then demonstrate how the impact of each choice on the results can be assessed using multidimensional unfolding. In conclusion, based on previous literature and on our illustrative example, we claim that the multiplicity of design and analysis options combined with questionable research practices lead to biased interpretations of benchmark results and to over-optimistic conclusions. This issue should be considered by computational researchers when designing and analysing their benchmark studies and by the scientific community in general in an effort towards more reliable benchmark results.
翻译:近年来,科学界日益认识到需要以比较计算科学的方法为重点进行中立的基准研究,这种研究需要以比较计算科学的方法为重点。虽然最近文献中可以找到关于设计和分析中立基准研究的一般建议,但始终存在着一定的灵活性,其中包括选择数据集和业绩计量,处理缺失的性能价值,以及在数据集中如何汇总性能价值。由于这种灵活性,研究人员可能担心其选择如何影响结果,或者在最坏的情况下,可能倾向于从事有疑问的研究做法(例如有选择地报告结果或对设计或分析组成部分进行超强的修改),以适应其期望或希望。为了提高对这个问题的认识,我们使用一个实例基准研究来说明,在考虑将一系列设计和分析备选办法的所有可能组合时,如何使可变的基准结果成为可变的基准。我们然后展示如何利用多层面的发展来评估每项选择对结果的影响。最后,根据以往的文献和我们的例子,我们声称,设计和分析备选办法的多重性结合有疑问的研究做法,导致在进行更可靠的科学研究研究时,在进行更可靠的研究时,通过比较分析其基础性的研究结果和过分地分析研究,从而评估其结果。