Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta's DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.


翻译:自监督学习(SSL)推动了视觉表示学习的进展,但其在胸部X射线摄影(一种具有细粒度发现的高容量成像模态)中的价值仍不明确。Meta的DINOv3通过Gram锚定自蒸馏扩展了早期SSL模型。这些设计选择是否能改善胸部X射线摄影的迁移学习尚未得到系统测试。我们在七个数据集(n>814,000)上对DINOv3、DINOv2及ImageNet初始化进行了基准测试。评估了两个代表性骨干网络:ViT-B/16和ConvNeXt-B。图像分别在224x224、512x512和1024x1024像素分辨率下进行分析。我们还额外评估了来自70亿参数模型的冻结特征。主要结果是跨标签的平均AUROC。在224x224分辨率下,DINOv3和DINOv2在成人数据集上表现相当。将分辨率提升至512x512时,DINOv3相对于DINOv2和ImageNet均获得持续改善。相比之下,儿科队列的结果在不同初始化方法间未显示差异。在所有设置中,ConvNeXt-B均优于ViT-B/16。使用冻结DINOv3-7B特征的模型表现不及完全微调的8600万至8900万参数骨干网络,凸显了领域适应的重要性。缩放至1024x1024分辨率并未进一步提升准确率。分辨率相关的增益在边界依赖性和小灶性异常检测中最为明显。在胸部X射线摄影中,更高的输入分辨率对于发挥现代自监督模型的优势至关重要。512x512像素代表了实际性能上限,此时DINOv3初始化的ConvNeXt-B网络提供最强性能,而更大尺寸的输入带来的收益成本比极低。临床实践中,这些发现支持使用512x512分辨率下经微调的中型骨干网络进行胸部X光判读,预计在检测与急诊和重症监护相关的细微或边界中心病变方面将获得最大收益。

0
下载
关闭预览

相关内容

【AAAI2021】“可瘦身”的生成式对抗网络
专知会员服务
13+阅读 · 2020年12月12日
AAAI 2022 | ProtGNN:自解释图神经网络
专知
10+阅读 · 2022年2月28日
ICLR'21 | GNN联邦学习的新基准
图与推荐
12+阅读 · 2021年11月15日
国家自然科学基金
3+阅读 · 2015年12月31日
VIP会员
相关VIP内容
相关资讯
相关基金
国家自然科学基金
3+阅读 · 2015年12月31日
Top
微信扫码咨询专知VIP会员