Scarcity of labeled histopathology data limits the applicability of deep learning methods to under-profiled cancer types and labels. Transfer learning allows researchers to overcome the limitations of small datasets by pre-training machine learning models on larger datasets \emph{similar} to the small target dataset. However, similarity between datasets is often determined heuristically. In this paper, we propose a principled notion of distance between histopathology datasets based on a hierarchical generalization of optimal transport distances. Our method does not require any training, is agnostic to model type, and preserves much of the hierarchical structure in histopathology datasets imposed by tiling. We apply our method to H\&E stained slides from The Cancer Genome Atlas from six different cancer types. We show that our method outperforms a baseline distance in a cancer-type prediction task. Our results also show that our optimal transport distance predicts difficulty of transferability in a tumor vs.~normal prediction setting.
翻译:标签的生理病理学数据稀少限制了深层次学习方法对受人关注的癌症类型和标签的适用性。 转移学习使研究人员能够克服小数据集的局限性,通过培训前的机器学习模型,在较大数据集中的大型数据集 \ emph{ 类似} 与小目标数据集 。 但是, 数据集之间的相似性往往被超自然测定。 在本文中, 我们提出了一个基于最优迁移距离等级化的病理数据集之间的距离原则性概念。 我们的方法不需要任何培训, 对模型类型是不可知的, 并保留了由平铺定的基因病理学数据集中的大部分等级结构。 我们用我们的方法对癌症基因组图集中六种不同癌症类型的H ⁇ E显示的幻灯片进行应用。 我们显示,我们的方法在癌症类预测任务中超越了基线距离。 我们的结果还表明, 我们的最佳迁移距离预测在肿瘤与. ~ 正常预测设置时难以转移。