With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
翻译:随着下一代测序(NGS)技术的迅速增长,每天收集并需要处理大量“缩略语”数据。在此情况下,将大型序列数据集索引化和压缩是最重要的任务之一。这里我们建议使用大数据技术(即Apache Spark和Hadoop)计算Burrows Wheeler变形的算法。我们的算法是最早分发指数计算法的算法,而不仅仅是输入数据集的算法,这样就可以充分利用现有的云源资源。