系统性地误估经济大萧条的内建工程学研究中的机器学习绩效 (Systematic Misestimation of Machine Learning Performance in Neuroimaging Studies of Depression)

Claas Flint,Micah Cearns,Nils Opel,Ronny Redlich,David M. A. Mehler,Daniel Emden,Nils R. Winter,Ramona Leenings,Simon B. Eickhoff,Tilo Kircher,Axel Krug,Igor Nenadic,Volker Arolt,Scott Clark,Bernhard T. Baune,Xiaoyi Jiang,Udo Dannlowski,Tim Hahn

We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from major depressive disorder (MDD) and healthy control (HC) based on neuroimaging data. Drawing upon structural magnetic resonance imaging (MRI) data from a balanced sample of $N = 1,868$ MDD patients and HC from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of $61\,\%$. Next, we mimicked the process by which researchers would draw samples of various sizes ($N = 4$ to $N = 150$) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes ($N = 20$), we observe accuracies of up to $95\,\%$. For medium sample sizes ($N = 100$) accuracies up to $75\,\%$ were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.

翻译：目前,我们在精神病学的机器学习研究中观察到一种令人不安的现象:虽然我们期望更大的样本能够由于获得更多的数据而产生更好的结果,但更大的机器学习研究始终显示业绩比许多小规模研究要差得多。在这里,我们系统地调查了这种影响,重点是该领域研究最密集的问题之一,即根据神经成像数据对患有严重抑郁症和健康控制的病人进行分类;利用结构磁共振成像(MRI)数据,从我们最近的国际预测分析竞赛(PAC)的平衡样本中得出1,868亿美元病人和HC的更好结果;我们首次培训和测试了整个数据集的分类模型,该模型的准确性为61美元/%;接着,我们模仿了研究人员从人口中提取各种大小的样本(N=4美元至150美元)的过程,并显示出一种严重的误测风险。具体地说,对于小样本规模(N=20美元)和高浓度的HC,我们观察到目前价值较大的文献量,而对于100美元以上的样本测试结果则显示为75美元。