通过大数据闪烁平台进行无对齐的基因组分析 (Alignment-free Genomic Analysis via a Big Data Spark Platform)

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.

翻译：动力: 无协调的距离和相似功能( AF 功能,短) 是许多基因组、 medagenomic 和 megenomic 任务中两个和多个序列匹配的既定替代方法。由于数据密集型应用程序, AF 函数的计算是一个大数据问题, 最近的文献显示, 计算 AF 函数的快速和可缩放的算法的开发是一项高度优先任务。令人惊讶的是, 尽管计算生物学中大数据技术越来越受欢迎, 但这些任务的大数据平台的开发可能因其复杂性而没有被采用。结果: 我们通过引入FADE(FADE), 填补了这一重要的缺口。 FADE(FADE)是第一个可扩展的、高效和可缩放的电源平台。最新文献显示, 开发FAFADE(FADE) 的最佳功能由最近的一项标志性基准研究产生, 包括新的关注方面。也就是说, 分配算法的相当可观的算法, 最明显的结果是像MASH和FSWM这样的参考方法执行时间要快得多。 (b),我们认为, 快速的FADE 的软件设计可以轻易地分析。