The yearly global production of data is growing exponentially, outpacing the capacity of existing storage media, such as tape and disk, and surpassing our ability to store it. DNA storage - the representation of arbitrary information as sequences of nucleotides - offers a promising storage medium. DNA is nature's information storage molecule of choice and has a number of key properties: it is extremely dense, offering the theoretical possibility of storing 455 EB/g, it is durable with a half-life of approximately 520 years which can be increased to thousands of years when chilled and stored dry, and is amenable to automated synthesis and sequencing. Furthermore, the potential exists for using biochemical processes that act on DNA for performing highly parallel computations. Whilst biological information is encoded in DNA as triplet sequences of nucleotides (also referred to as bases or base pairs) (A,T,G, or C), i.e., base 4 - known as codons - there are many possible encoding schemes that can map data to chemical sequences of nucleotides for synthesis, storage, retrieval and computation. However, there are several biological constraints, error-correcting factors and information retrieval considerations that encoding schemes need to address for DNA storage to be viable. This comprehensive review focuses on comparing existing work done in encoding arbitrary data within DNA, particularly the encoding schemes used, methods employed to address biological constraints, and measures to provide error-correction. Furthermore, we compare encoding approaches on the overall information density they achieve, as well as the data retrieval method they use, i.e., sequential or random access. We will also discuss the background and evolution of the encoding schemes.
翻译:每年全球数据生产呈指数性增长,超过了磁带和磁盘等现有存储媒体的存储能力,超过了现有存储媒体(如磁带和磁盘)的存储能力,超过了我们储存数据的能力。DNA储存 -- -- 任意信息作为核糖核酸序列的任意信息表示 -- -- 是一个很有希望的存储介质。DNA是大自然信息存储的选用分子,具有若干关键特性:DNA极稠密,提供了存储455 EB/g的理论可能性,其半衰期大约为520年,当冷却和储存干燥时,这种半衰期可以增加到数千年,并且可以自动合成和测序。此外,使用生物化学过程来代表DNA进行高度平行的计算。尽管生物信息信息在DNA的三重序列中被编码为核酸核糖分解的三重序列(也称为基础或基配对)(A、T、G或C),也就是说,基底4年半衰期(称为codon) -- -- 它们有许多可能的编码计划,可以用来将数据绘制用于合成、存储、存储、检索和计算过程的化学序列的合成和顺序。然而,这种生物数据分析方法中,这些是用来比较用于进行数据检索的、用于用于DNA的精确统计统计的统计的统计的考虑。