Similarity search is one of the most fundamental computations that are regularly performed on ever-increasing protein datasets. Scalability is of paramount importance for uncovering novel phenomena that occur at very large scales. We unleash the power of over 20,000 GPUs on the Summit system to perform all-vs-all protein similarity search on one of the largest publicly available datasets with 405 million proteins, in less than 3.5 hours, cutting the time-to-solution for many use cases from weeks. The variability of protein sequence lengths, as well as the sparsity of the space of pairwise comparisons, make this a challenging problem in distributed memory. Due to the need to construct and maintain a data structure holding indices to all other sequences, this application has a huge memory footprint that makes it hard to scale the problem sizes. We overcome this memory limitation by innovative matrix-based blocking techniques, without introducing additional load imbalance.
翻译:近似性搜索是经常在不断增加的蛋白质数据集中进行的最基本的计算方法之一。 缩放性对于发现大规模发生的新现象至关重要。 我们释放了峰会系统中20,000多个GPU的力量,对拥有4.05亿蛋白质的最大公开数据集之一进行全V-所有蛋白相似性搜索,在不到3.5小时之内进行搜索,缩短了几周以来许多使用案例的解析时间。蛋白质序列长度的变异性,以及对称比较空间的宽广性,这在分布式记忆中造成了一个具有挑战性的问题。由于需要构建和维护一个持有所有其他序列指数的数据结构,这一应用具有巨大的记忆足迹,因此难以缩小问题大小。我们通过基于矩阵的创新阻塞技术克服了这一记忆限制,而没有引入额外的负载不平衡。</s>