Persistent homology (PH) is an approach to topological data analysis (TDA) that computes multi-scale topologically invariant properties of high-dimensional data that are robust to noise. While PH has revealed useful patterns across various applications, computational requirements have limited applications to small data sets of a few thousand points. We present Dory, an efficient and scalable algorithm that can compute the persistent homology of large data sets. Dory uses significantly less memory than published algorithms and also provides significant reductions in the computation time compared to most algorithms. It scales to process data sets with millions of points. As an application, we compute the PH of the human genome at high resolution as revealed by a genome-wide Hi-C data set. Results show that the topology of the human genome changes significantly upon treatment with auxin, a molecule that degrades cohesin, corroborating the hypothesis that cohesin plays a crucial role in loop formation in DNA.
翻译:持久性同系物(PH)是一种用于统计数据分析的方法(TDA),它计算出对噪音具有强力作用的高维数据多尺度的表层变异特性。虽然PH揭示了各种应用的有用模式,但计算要求对几千个点的小型数据集的应用有限。我们提出了Dory,这是一个高效和可扩缩的算法,可以计算大型数据集的持久性同系物。Dory使用比公布的算法要少得多的内存,而且与大多数算法相比,计算时间也大大缩短。它用数百万个点来处理数据集。作为一个应用,我们计算出人类基因组的PH高分辨率,这是全基因组的HC数据集所揭示的。结果显示,人类基因组的表层在用一个氧化物进行治疗时发生了重大变化,一种分子会降解cohesin,证实了Cosin在DNA的循环形成中起着关键作用的假设。