Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging. This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments.
翻译:由于DNA的寿命长,信息密度巨大,DNA是档案数据存储的诱人媒介。由于技术的迅速进步,DNA储存正在变得实际可行,正如一些实验性储存系统所显示的那样,DNA储存正在变得非常可行,这使我们社会越来越需要数据储存,成为我们社会日益需要数据储存的一个有希望的解决办法。在生命中,DNA分子可以由数百万核素分子组成。在实际中,由于技术限制,DNA分子可以储存成百万核素核素分子。由于在DNA库中保存,无法在空间上排列数据。此外,由于DNA的顺序、合成和处理以及储存过程中的DNA的不完善性,我们提议在系统中引入随机噪音,使DNA储存和检索信息的任务变得艰巨。这一独特的设置提出了一个自然的信息理论问题:由于技术的限制,DNA分子序列中有多少信息可以可靠地储存起来,再从数百万个短的核素分子序列中重建。我们提出的DNA合成和测序中,我们提出的DNA储存和测序中的三个关键特征的解解算法模型,通过DNA储存的每个数据都是从DNA储存的每一个DNA储存和分子的精度分析的精度,这些DNA的精度,数据都是从DNA的精度的精度的精度的精度。