医学影像AI竞赛缺乏公平性 (Medical Imaging AI Competitions Lack Fairness)

Annika Reinke,Evangelia Christodoulou,Sthuthi Sadananda,A. Emre Kavur,Khrystyna Faryna,Daan Schouten,Bennett A. Landman,Carole Sudre,Olivier Colliot,Nick Heller,Sophie Loizillon,Martin Maška,Maëlys Solal,Arya Yazdan-Panah,Vilma Bozgo,Ömer Sümer,Siem de Jong,Sophie Fischer,Michal Kozubek,Tim Rädsch,Nadim Hammoud,Fruzsina Molnár-Gábor,Steven Hicks,Michael A. Riegler,Anindo Saha,Vajira Thambawita,Pal Halvorsen,Amelia Jiménez-Sánchez,Qingyang Yang,Veronika Cheplygina,Sabrina Bottazzi,Alexander Seitel,Spyridon Bakas,Alexandros Karargyris,Kiran Vaidhya Venkadesh,Bram van Ginneken,Lena Maier-Hein

from arxiv, Submitted to Nature BME

Benchmarking competitions are central to the development of artificial intelligence (AI) in medical imaging, defining performance standards and shaping methodological progress. However, it remains unclear whether these benchmarks provide data that are sufficiently representative, accessible, and reusable to support clinically meaningful AI. In this work, we assess fairness along two complementary dimensions: (1) whether challenge datasets are representative of real-world clinical diversity, and (2) whether they are accessible and legally reusable in line with the FAIR principles. To address this question, we conducted a large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities. Our findings show substantial biases in dataset composition, including geographic location, modality-, and problem type-related biases, indicating that current benchmarks do not adequately reflect real-world clinical diversity. Despite their widespread influence, challenge datasets were frequently constrained by restrictive or ambiguous access conditions, inconsistent or non-compliant licensing practices, and incomplete documentation, limiting reproducibility and long-term reuse. Together, these shortcomings expose foundational fairness limitations in our benchmarking ecosystem and highlight a disconnect between leaderboard success and clinical relevance.

翻译：基准竞赛是医学影像人工智能发展的核心，它们定义了性能标准并塑造了方法论进步。然而，这些基准是否提供了足够代表性、可访问性和可重复使用的数据以支持具有临床意义的人工智能，目前尚不明确。在本研究中，我们从两个互补维度评估公平性：(1) 挑战数据集是否代表了真实世界的临床多样性；(2) 它们是否遵循FAIR原则，具备可访问性和法律上的可重复使用性。为探究此问题，我们对241项生物医学图像分析挑战进行了大规模系统研究，涵盖19种成像模态的458项任务。我们的研究结果显示数据集构成存在显著偏差，包括地理位置、模态及问题类型相关的偏差，表明当前基准未能充分反映真实世界的临床多样性。尽管这些挑战数据集影响广泛，但它们常受到限制性或模糊的访问条件、不一致或不合规的许可实践以及不完整文档的制约，限制了可重复性和长期重用性。这些缺陷共同揭示了当前基准生态系统存在根本性的公平性局限，并凸显了排行榜成功与临床相关性之间的脱节。