重新加载的度量:坑和图像分析验证建议 (Metrics reloaded: Pitfalls and recommendations for image analysis validation)

Lena Maier-Hein,Annika Reinke,Evangelia Christodoulou,Ben Glocker,Patrick Godau,Fabian Isensee,Jens Kleesiek,Michal Kozubek,Mauricio Reyes,Michael A. Riegler,Manuel Wiesenfarth,Michael Baumgartner,Matthias Eisenmann,Doreen Heckmann-Nötzel,A. Emre Kavur,Tim Rädsch,Minu D. Tizabi,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Peter Bankhead,Arriel Benis,M. Jorge Cardoso,Veronika Cheplygina,Beth Cimini,Gary S. Collins,Keyvan Farahani,Bram van Ginneken,Daniel A. Hashimoto,Michael M. Hoffman,Merel Huisman,Pierre Jannin,Charles E. Kahn,Alexandros Karargyris,Alan Karthikesalingam,Hannes Kenngott,Annette Kopp-Schneider,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,David Moher,Karel G. M. Moons,Henning Müller,Felix Nickel,Brennan Nichyporuk,Jens Petersen,Nasir Rajpoot,Nicola Rieke,Julio Saez-Rodriguez,Clarisa Sánchez Gutiérrez,Shravya Shetty,Maarten van Smeden,Carole H. Sudre,Ronald M. Summers,Abdel A. Taha,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Paul F. Jäger

from arxiv, Shared first authors: Lena Maier-Hein, Annika Reinke. arXiv admin note: substantial text overlap with arXiv:2104.05642

The field of automatic biomedical image analysis crucially depends on robust and meaningful performance metrics for algorithm validation. Current metric usage, however, is often ill-informed and does not reflect the underlying domain interest. Here, we present a comprehensive framework that guides researchers towards choosing performance metrics in a problem-aware manner. Specifically, we focus on biomedical image analysis problems that can be interpreted as a classification task at image, object or pixel level. The framework first compiles domain interest-, target structure-, data set- and algorithm output-related properties of a given problem into a problem fingerprint, while also mapping it to the appropriate problem category, namely image-level classification, semantic segmentation, instance segmentation, or object detection. It then guides users through the process of selecting and applying a set of appropriate validation metrics while making them aware of potential pitfalls related to individual choices. In this paper, we describe the current status of the Metrics Reloaded recommendation framework, with the goal of obtaining constructive feedback from the image analysis community. The current version has been developed within an international consortium of more than 60 image analysis experts and will be made openly available as a user-friendly toolkit after community-driven optimization.

翻译：生物医学图像自动分析领域关键地取决于用于算法验证的稳健和有意义的性能衡量标准。但是,目前的计量使用往往信息不全,没有反映基本领域利益。在这里,我们提出了一个综合框架,指导研究人员以有问题的方式选择性能衡量标准。具体地说,我们侧重于生物医学图像分析问题,这些问题可以被解释为在图像、对象或像素层面的分类任务。框架首先将特定问题的域利、目标结构、数据集和算法产出相关属性编成一个问题指纹,同时将其映射为适当的问题类别,即图像等级分类、语义分解、实例分解或对象探测。然后,我们通过选择和运用一套适当性能衡量标准来指导用户,同时使他们了解与个人选择有关的潜在缺陷。我们在本文件中描述Metris Reloadd建议框架的现状,目的是从图像分析界获得建设性的反馈。目前版本是在60多名图像分析专家组成的国际财团内开发的,并将它绘制成一个适当的问题类别,然后作为用户友好工具在社区驱动后公开提供。