图像处理度量的常见限制:图片故事 (Common Limitations of Image Processing Metrics: A Picture Story)

Annika Reinke,Minu D. Tizabi,Carole H. Sudre,Matthias Eisenmann,Tim Rädsch,Michael Baumgartner,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Peter Bankhead,Arriel Benis,M. Jorge Cardoso,Veronika Cheplygina,Evangelia Christodoulou,Beth Cimini,Gary S. Collins,Keyvan Farahani,Bram van Ginneken,Ben Glocker,Patrick Godau,Fred Hamprecht,Daniel A. Hashimoto,Doreen Heckmann-Nötzel,Michael M. Hoffmann,Merel Huisman,Fabian Isensee,Pierre Jannin,Charles E. Kahn,Alexandros Karargyris,Alan Karthikesalingam,Bernhard Kainz,Emre Kavur,Hannes Kenngott,Jens Kleesiek,Thijs Kooi,Michal Kozubek,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,David Moher,Karel G. M. Moons,Henning Müller,Brennan Nichyporuk,Felix Nickel,Jens Petersen,Gorkem Polat,Nasir Rajpoot,Mauricio Reyes,Nicola Rieke,Michael Riegler,Hassan Rivaz,Julio Saez-Rodriguez,Clarisa Sanchez Gutierrez,Julien Schroeter,Anindo Saha,Shravya Shetty,Maarten van Smeden,Bram Stieltjes,Ronald M. Summers,Abdel A. Taha,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Manuel Wiesenfarth,Ziv R. Yaniv,Annette Kopp-Schneider,Paul Jäger,Lena Maier-Hein

from arxiv, This is a dynamic paper on limitations of commonly used metrics. The current version discusses metrics for image-level classification, semantic segmentation, object detection and instance segmentation. For missing use cases, comments or questions, please contact a.reinke@dkfz.de or l.maier-hein@dkfz.de. Substantial contributions to this document will be acknowledged with a co-authorship

While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. These are typically related to (1) the disregard of inherent metric properties, such as the behaviour in the presence of class imbalance or small target structures, (2) the disregard of inherent data set properties, such as the non-independence of the test cases, and (3) the disregard of the actual biomedical domain interest that the metrics should reflect. This living dynamically document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. In this context, it focuses on biomedical image analysis problems that can be phrased as image-level classification, semantic segmentation, instance segmentation, or object detection task. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts from more than 60 institutions worldwide.

翻译：虽然自动图像分析的重要性在不断增加,但最近的元研究揭示了算法验证方面的主要缺陷; 业绩计量是有意义、客观和透明的业绩评估以及用过的自动算法验证工作的关键,但对于使用特定图像分析任务的具体指标时的实际陷阱,注意的相对较少; 这些通常与下列因素有关:(1) 无视内在的计量特性,如阶级不平衡或目标结构小的情况下的行为;(2) 无视固有的数据集属性,如测试案例不独立;(3) 无视计量指标应反映的实际生物医学域域利益;这一动态文件的目的是说明在图像分析领域通常应用的业绩计量存在的重要局限性;在这方面,它侧重于生物医学图像分析问题,可称之为图像等级分类、语义分解、实例分解或目标检测任务; 目前的版本以来自世界各地60多个机构的国际图像分析专家联合会进行的关于指标的德尔菲进程为基础。