DNA has been considered a promising medium for storing digital information. As an essential step in the DNA-based data storage workflow, coding algorithms are responsible to implement functions including bit-to-base transcoding, error correction, etc. In previous studies, these functions are normally realized by introducing multiple algorithms. Here, we report a graph-based architecture, named SPIDER-WEB, providing an all-in-one coding solution by generating customized algorithms automatically. SPIDERWEB is able to correct a maximum of 4% edit errors in the DNA sequences including substitution and insertion/deletion (indel), with only 5.5% redundant symbols. Since no DNA sequence pretreatment is required for the correcting and decoding processes, SPIDER-WEB offers the function of real-time information retrieval, which is 305.08 times faster than the speed of single-molecule sequencing techniques. Our retrieval process can improve 2 orders of magnitude faster compared to the conventional one under megabyte-level data and can be scalable to fit exabyte-level data. Therefore, SPIDER-WEB holds the potential to improve the practicability in large-scale data storage applications.
翻译:DNA 被认为是一种储存数字信息的有前途的媒介。在DNA基础数据存储流程中,编码算法是实现位于基因和数据之间的功能的关键,包括将二进制转换成碱基、错误校正等。在之前的研究中,这些功能通常是通过引入多种算法来实现的。本文报告了一种图形化架构,名为SPIDER-WEB,提供了一种全方位的编码解决方案,可自动生成定制算法。SPIDER-WEB能够校正DNA序列中的最大4%的编辑错误(包括替换、插入/删除(indel)),仅使用了5.5%的冗余符号。由于校正和解码过程不需要DNA序列预处理,SPIDER-WEB提供了实时信息检索功能,其速度比单分子测序技术快305.08倍。我们的检索过程在兆字节级别的数据下可以比传统检索方式快2个数量级,并可扩展到拟解决艾克斯特字节级别数据。因此,SPIDER-WEB有望提高大规模数据存储应用的实用性。