We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long molecular dynamics simulation trajectories. In this approach we combine two dimensionality reduction algorithms (cc\_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN). The proposed scheme benefits from the strengths of the three algorithms while avoiding most of the drawbacks of the individual methods. Here the cc\_analysis algorithm is for the first time applied to molecular simulation data. Encodermap complements cc\_analysis by providing an efficient way to process and assign large amounts of data to clusters. The main goal of the procedure is to maximize the number of assigned frames of a given trajectory, while keeping a clear conformational identity of the clusters that are found. In practice we achieve this by using an iterative clustering approach and a tunable root-mean-square-deviation-based criterion in the final cluster assignment. This allows to find clusters of different densities as well as different degrees of structural identity. With the help of four test systems we illustrate the capability and performance of this clustering workflow: wild-type and thermostable mutant of the Trp-cage protein (TC5b and TC10b), NTL9 and Protein B. Each of these systems poses individual challenges to the scheme, which in total give a nice overview of the advantages, as well as potential difficulties that can arise when using the proposed method.
翻译:我们提出了一个未经监督的数据处理工作流程,专门设计该流程是为了获得长分子动态模拟轨迹的快速一致组合。在这种方法中,我们把两个维度减少算法(cc ⁇ analys and encodermap)与一个基于密度的空间群集算法(HDBSCAN)结合起来。拟议办法得益于三种算法的优点,同时避免了个别方法的大部分缺点。这里的cc ⁇ 分析算法首次适用于分子模拟数据。Ecodermap通过提供高效的方法处理和分配大量数据给集群,对cc ⁇ 分析加以补充。该程序的主要目标是使指定轨迹框架的数量最大化,同时保持所发现集群的清晰一致特性。在实践中,我们通过使用迭代组合法和基于金枪鱼的根素质量衡量标准,在最后的集群任务中找到不同密度的组合以及不同程度的结构身份。在四种测试系统中,我们提出了将一个特定轨迹的参数和性能,也就是将这种基因组群集法的优势和性能,即:每个恒变变变变的机法的机型和变机法,可以使这些基因组的机变式的机变式和变式的机变式的机变式和变式系统成为一个总的挑战。