了解图像分析验证工作中与衡量标准有关的缺陷 (Understanding metric-related pitfalls in image analysis validation)

Annika Reinke,Minu D. Tizabi,Michael Baumgartner,Matthias Eisenmann,Doreen Heckmann-Nötzel,A. Emre Kavur,Tim Rädsch,Carole H. Sudre,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Arriel Benis,Matthew Blaschko,Florian Büttner,M. Jorge Cardoso,Veronika Cheplygina,Jianxu Chen,Evangelia Christodoulou,Beth A. Cimini,Gary S. Collins,Keyvan Farahani,Luciana Ferrer,Adrian Galdran,Bram van Ginneken,Ben Glocker,Patrick Godau,Robert Haase,Daniel A. Hashimoto,Michael M. Hoffman,Merel Huisman,Fabian Isensee,Pierre Jannin,Charles E. Kahn,Dagmar Kainmueller,Bernhard Kainz,Alexandros Karargyris,Alan Karthikesalingam,Hannes Kenngott,Jens Kleesiek,Florian Kofler,Thijs Kooi,Annette Kopp-Schneider,Michal Kozubek,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,Karel G. M. Moons,Henning Müller,Brennan Nichyporuk,Felix Nickel,Jens Petersen,Susanne M. Rafelski,Nasir Rajpoot,Mauricio Reyes,Michael A. Riegler,Nicola Rieke,Julio Saez-Rodriguez,Clara I. Sánchez,Shravya Shetty,Maarten van Smeden,Ronald M. Summers,Abdel A. Taha,Aleksei Tiulpin,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Manuel Wiesenfarth,Ziv R. Yaniv,Paul F. Jäger,Lena Maier-Hein

Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

翻译：验证指标是可靠跟踪科学进步和弥合人工智能(AI)研究及其转化为实践之间的现有鸿沟的关键,然而,越来越多的证据表明,特别是在图像分析方面,往往在基本研究问题方面选择的衡量标准不够充分,其原因可能是缺乏获得与指标有关的知识的机会:考虑到个别的长处、弱点和验证指标的局限性是作出受过教育的选择的关键先决条件,但有关知识目前是分散的,个别研究人员很难获得。根据多学科专家联合会开展的多阶段德尔菲进程以及广泛的社区反馈,目前的工作提供了第一个可靠和综合的共同点,即获取与图像分析中验证指标有关的缺陷信息。重点是生物医学图像分析,但有向其他领域转让的潜力,处理的缺陷是跨应用领域的一般问题,并按新创建的域-农学分类法分类分类。为了便于理解、插图和具体实例随附在每一个陷阱中。作为各级研究人员可获取的结构性信息主体,这项工作加强了对关键图像分析的全球理解。