AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
翻译:肿瘤分割人工智能的发展受限于缺乏大规模、体素级标注的数据集,这类数据集创建困难且需要医学专家参与。在我们专有的包含3000例标注胰腺肿瘤扫描的JHH数据集中,我们发现人工智能性能在1500例扫描后停止提升。通过使用合成数据,我们仅用500例真实扫描就达到了同等性能。这一发现表明合成数据能够显著优化数据缩放规律,实现比单纯使用真实数据更高效的模型训练。基于这些经验,我们创建了AbdomenAtlas 2.0数据集——包含10,135例CT扫描,涵盖六个器官(胰腺、肝脏、肾脏、结肠、食管和子宫)共计15,130个肿瘤实例的逐体素人工标注,以及5,893例对照扫描。该数据集由23位放射学专家标注完成,其规模比现有公开肿瘤数据集大数个数量级。在持续扩展数据集的同时,当前版本的AbdomenAtlas 2.0已基于JHH数据集的经验,为训练六器官肿瘤分割人工智能提供了坚实基础。相较于公开数据集,该数据集取得了显著改进:在分布内测试中DSC指标提升7%,在分布外测试中提升16%。