利用变压器对DNA数据存储进行单读重建 (Single-Read Reconstruction for DNA Data Storage Using Transformers)

As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.

翻译：由于大规模数据储存的全球需求呈指数上升趋势,现有储存技术正在接近其密度和能源消耗方面的理论和功能限制,使基于DNA的储存成为未来数据储存的潜在解决办法。一些研究采用了基于DNA的储存系统,其信息密度高(peabytes/gram)。然而,DNA合成和排序技术产生错误产出。纠正这些错误的方法取决于阅读每个序列的多份副本,并导致阅读成本过高。变异器作为语言建模的深层次学习结构的空前成功,导致其重新定位,以解决不同领域的各种任务。在这项工作中,我们建议采用新颖的方法,利用基于DNA数据储存的编码-decoder变异器结构来进行单读的重建。我们把错误纠正过程当作一个自上而上至下的任务,用合成噪音注入来训练模型,只读解码读读读读读。我们的方法利用每个解码文件的内在冗余来了解其基本结构。为了展示我们拟议的方法,我们把文本、图像和代码化的版本文件用于DNA储存的单读变的版本,在DNA储存过程中,从原始的原始的升级的版本到DNA的版本,在DNA的版本中,用一种高级的版本的版本的版本中,可以显示一个高比重的版本的版本的版本的版本的版本的复制率,在DNA的版本中,在DNA的版本中,在DNA的版本中,在DNA的复制率中,在DNA的原始的复制率中,在DNA的复制率中可以显示的复制率中产生一种高校。