Gene expression can be used to subtype breast cancer with improved prediction of risk of recurrence and treatment responsiveness over that obtained using routine immunohistochemistry (IHC). However, in the clinic, molecular profiling is primarily used for ER+ cancer and is costly and tissue destructive, requires specialized platforms and takes several weeks to obtain a result. Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively. We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA-sequencing techniques to predict the expression of 138 genes (incorporated from six commercially available molecular profiling tests), including luminal PAM50 subtype, from hematoxylin and eosin (H&E) stained whole slide images (WSIs). The training phase involves the aggregation of extracted features for each patient from a pretrained model to predict gene expression at the patient level using annotated H&E images from The Cancer Genome Atlas (TCGA, n=335). We demonstrate successful gene prediction on a held-out test set (n=160, corr=0.82 across patients, corr=0.29 across genes) and perform exploratory analysis on an external tissue microarray (TMA) dataset (n=498) with known IHC and survival information. Our model is able to predict gene expression and luminal PAM50 subtype (Luminal A versus Luminal B) on the TMA dataset with prognostic significance for overall survival in univariate analysis (c-index=0.56, hazard ratio=2.16, p<0.005), and independent significance in multivariate analysis incorporating standard clinicopathological variables (c-index=0.65, hazard ratio=1.85, p<0.005).
翻译:基因表达可用于将乳腺癌进行亚型分类,优于使用常规免疫组化方法(IHC)获得的预测复发风险和治疗反应性。但是,在临床上,分子分型主要用于ER+癌症,成本高、破坏组织、需要专门的平台,并需要几周的时间才能获得结果。深度学习算法可以有效地提取数字病理学图像中的形态学模式,以快速和低成本地预测分子表型。我们提出了一种名为hist2RNA的新的,计算上高效的方法,受批量RNA测序技术的启发,从紫苏和伊红染色的全切片图像中预测138个基因(来自六个商业化的分子分型测试),包括Luminal PAM50亚型。训练阶段涉及从预训练模型中聚合每个患者的提取特征,使用来自癌症基因组图谱(TCGA,n=335)的已标注H&E图像来预测患者级别的基因表达。我们在一个保留的测试集(n=160,患者之间的corr=0.82,基因之间的corr=0.29)上成功地进行了基因预测,并在具有已知IHC和生存信息的外部组织微阵列(TMA)数据集(n=498)上进行了探索性分析。我们的模型能够预测基因表达和Luminal PAM50亚型(Luminal A versus Luminal B),并具有预后意义(在单变量分析中的c-index=0.56,风险比=2.16,p<0.005),并在整合标准临床病理学变量的多变量分析中具有独立的显著性(c-index=0.65,风险比=1.85,p<0.005)。