Human medical data can be challenging to obtain due to data privacy concerns, difficulties conducting certain types of experiments, or prohibitive associated costs. In many settings, data from animal models or in-vitro cell lines are available to help augment our understanding of human data. However, this data is known for having low etiological validity in comparison to human data. In this work, we augment small human medical datasets with in-vitro data and animal models. We use Invariant Risk Minimisation (IRM) to elucidate invariant features by considering cross-organism data as belonging to different data-generating environments. Our models identify genes of relevance to human cancer development. We observe a degree of consistency between varying the amounts of human and mouse data used, however, further work is required to obtain conclusive insights. As a secondary contribution, we enhance existing open source datasets and provide two uniformly processed, cross-organism, homologue gene-matched datasets to the community.
翻译:由于对数据隐私的关切、进行某些类型的实验的困难或令人望而却步的相关费用,人类医疗数据可能难以获得。在许多环境中,动物模型或体外细胞线的数据有助于增进我们对人类数据的理解。然而,这一数据据知与人类数据相比,其病理学有效性较低。在这项工作中,我们利用体外数据和动物模型来增加小型人类医疗数据集。我们利用不易风险最小化(IRM)来说明不同特性,将跨机体数据视为属于不同数据产生环境。我们的模型确定了与人类癌症发展相关的基因。我们观察到了人类和鼠标数据的不同数量之间的一致性,然而,我们还需要做进一步的工作才能获得结论性的洞察力。作为辅助贡献,我们加强现有的开放源数据集,并向社区提供两种统一处理的、跨机组、同源基因匹配数据集。