The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array construction algorithms had to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows us to process input sizes in orders of magnitude that have not been considered before.
翻译:后缀阵列是有效解决数据压缩、数据挖掘或生物信息等不同应用领域众多字符串处理问题的关键。 随着可用数据的快速增长,后缀阵列构建算法必须适应外部内存和分布计算等先进的计算模型。 在本篇文章中,我们展示了五个后缀阵列构建算法,利用新的算法大数据批量处理框架Thurill, 这使得我们能够按照以前未曾考虑过的数量级处理输入大小。