Scarcity of labeled histopathology data limits the applicability of deep learning methods to under-profiled cancer types and labels. Transfer learning allows researchers to overcome the limitations of small datasets by pre-training machine learning models on larger datasets similar to the small target dataset. However, similarity between datasets is often determined heuristically. In this paper, we propose a principled notion of distance between histopathology datasets based on a hierarchical generalization of optimal transport distances. Our method does not require any training, is agnostic to model type, and preserves much of the hierarchical structure in histopathology datasets imposed by tiling. We apply our method to H&E stained slides from The Cancer Genome Atlas from six different cancer types. We show that our method outperforms a baseline distance in a cancer-type prediction task. Our results also show that our optimal transport distance predicts difficulty of transferability in a tumor vs.normal prediction setting.
翻译:缺乏标签的病理学数据限制了深层次学习方法对受人关注的癌症类型和标签的适用性。 转移学习使研究人员能够克服在类似小目标数据集的大型数据集上通过培训前机器学习模型对小型数据集的局限性。 但是,数据集之间的相似性往往是由黑理学来决定的。 在本文中,我们提出了一个基于对最佳迁移距离的等级一般化的生理病理学数据集之间距离的原则性概念。 我们的方法不需要任何培训,对模型类型是不可知性的,并保留了由平铺所强加的病理学数据集中的大部分等级结构。 我们用我们的方法对癌症基因组图集图示中六种癌症类型的H & E 染色幻灯片应用了我们的方法。 我们表明,我们的方法在癌症类型的预测任务中超越了基线距离。 我们的结果还表明,我们的最佳迁移距离预测在肿瘤与正常预测设置中难以转移。