HICOPS: 用于Tera规模数据库大规模光谱测定基础人类数据搜索的高性能计算框架 (HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data)

Database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Existing serial, and high-performance computing (HPC) search engines, otherwise highly successful, are known to exhibit poor-scalability with increasing size of theoretical search-space needed for increased complexity of modern non-model, multi-species MS-based omics analysis. Consequently, the bottleneck for computational techniques is the communication costs of moving the data between hierarchy of memory, or processing units, and not the arithmetic operations. This post-Moore change in architecture, and demands of modern systems biology experiments have dampened the overall effectiveness of the existing HPC workflows. We present a novel efficient parallel computational method, and its implementation on memory-distributed architectures for peptide identification tool called HiCOPS, that enables more than 100-fold improvement in speed over most existing HPC proteome database search tools. HiCOPS empowers the supercomputing database search concept for comprehensive identification of peptides, and all their modified forms within a reasonable time-frame. We demonstrate this by searching Gigabytes of experimental MS data against Terabytes of databases where HiCOPS completes peptide identification in few minutes using 72 parallel nodes (1728 cores) compared to several weeks required by existing state-of-the-art tools using 1 node (24 cores); 100 minutes vs 5 weeks; 500x speedup. Finally, we formulate a theoretical framework for our overhead-avoiding strategy, and report superior performance evaluation results for key metrics including execution time, CPU utilization, speedups, and I/O efficiency. The software will be made available at: hicops.github.io

翻译：数据库搜索算法从质量分光仪(MS)数据中推导出peptides, 试图提高计算效率, 以完成更大规模、更复杂的系统生物学研究。现有的序列和高性能计算( HPC) 搜索引擎, 否则非常成功, 已知其扩展性较差, 现代非模型、多物种、多物种、 MS- ommics 分析的复杂程度需要越来越多的理论搜索空间。因此, 计算技术的瓶颈是将数据移动到存储器、或处理器、而不是计算操作之间的通信成本。这种后摩德结构的改变, 以及现代系统生物学实验的要求, 削弱了现有 HPC工作流程的总体有效性。我们提出了一种新型高效的平行计算方法, 及其在存储点识别工具( HICOP S) 的记忆分配性结构上的应用, 使得大多数 HPC Proteome 数据库搜索工具的速率有100倍以上。 HICOPS 授权超级计算机化数据库搜索概念, 用于全面识别 peptideideideidede, 以及所有 IMA IMA 核心数据库的运行。