Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.
翻译:背景:开发用于肺癌筛查的人工智能(AI)模型需要大规模、标注良好的低剂量计算机断层扫描(CT)数据集以及严格的性能基准。目的:利用杜克肺癌筛查(DLCS)数据集及多个公共数据集,构建可复现的基准测试资源,以开发和评估肺结节检测与分类模型。材料与方法:本回顾性研究使用DLCS数据集(1,613例患者;2,487个结节)及外部数据集(包括LUNA16、LUNA25和NLST-3D)。在检测任务中,基于MONAI框架的RetinaNet模型分别在DLCS(DLCS-De)和LUNA16(LUNA16-De)上训练,并使用竞赛性能指标(CPM)进行评估。在结节级分类任务中,我们比较了五种策略:预训练模型(Models Genesis、Med3D)、自监督基础模型(FMCB)、随机初始化的ResNet50,以及采用基于检测置信度分层的候选图像块进行预训练的Strategic Warm-Start策略(ResNet50-SWS)。结果:在DLCS测试集上,DLCS-De在每扫描2个假阳性时达到灵敏度0.82(CPM 0.63),优于LUNA16-De(灵敏度0.62,CPM 0.45)。在NLST-3D的外部验证中,DLCS-De(灵敏度0.72,CPM 0.58)同样优于LUNA16-De(灵敏度0.64,CPM 0.49)。在跨数据集分类任务中,ResNet50-SWS获得的曲线下面积(AUC)分别为:DLCS 0.71(95% CI,0.61-0.81)、LUNA16 0.90(0.87-0.93)、NLST-3D 0.81(0.79-0.82)、LUNA25 0.80(0.78-0.82),达到或超越了预训练/自监督基线模型。性能差异反映了数据集的标注标准差异。结论:本研究为肺癌AI研究建立了标准化的基准测试资源,支持模型开发、验证与转化。所有代码、模型及数据均已公开,以促进可复现性。