DNA has immense potential as an emerging data storage medium. The principle of DNA storage is the conversion and flow of digital information between binary code stream, quaternary base, and actual DNA fragments. This process will inevitably introduce errors, posing challenges to accurate data recovery. Sequence reconstruction consists of inferring the DNA reference from a cluster of erroneous copies. A common assumption in existing methods is that all the strands within a cluster are noisy copies originating from the same reference, thereby contributing equally to the reconstruction. However, this is not always valid considering the existence of contaminated sequences caused, for example, by DNA fragmentation and rearrangement during the DNA storage process.This paper proposed a robust multi-read reconstruction model using DNN, which is resilient to contaminated clusters with outlier sequences, as well as to noisy reads with IDS errors. The effectiveness and robustness of the method are validated on three next-generation sequencing datasets, where a series of comparative experiments are performed by simulating varying contamination levels that occurring during the process of DNA storage.
翻译:DNA储存的原则是二元代码流、四元基和实际DNA碎片之间的数字信息转换和流动。 这一过程将不可避免地引入错误,对准确的数据恢复构成挑战。 序列重建包括从一组错误的复制件中推断DNA参考。 现有方法的一个共同假设是,一个组内的所有条纹都是来自同一组内的杂音副本,从而同样有助于重建。 但是,考虑到DNA储存过程中的DNA碎裂和重新排列等因素造成的受污染序列的存在,这并不总是有效的。 本文提议使用DNN(DNN)(DNN)(DNN)(DNN)(DN)(DDS)(DS)(DS)(DS)(DS)(DS)(DS)(DS)(外序号)(错误)能够适应受污染的组群集,并能够与IDS(IDS)(IDS)(错误)(DDD(错误)(DD)(错误)(错误)(DDDD(错误)(错误)(DDN)(错误)(DN(错误)(DN)(DN(DN)(DN(错误)(DN)(DN)(DN)(D)(D)(DN(D)(DDDD)(D(DD)(D)(错误)(DD)(D)(D)(D)(D)(DS)(D)(D)(D)(D)(D(D)(DN)(DS)(D(DDDDD(D)(DDD(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(DS)(DS)(D)(DS)(DS)(DS)(DNDN)(D)(D)(DN)(D)(D)(D)(DR)(DN)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D)(D(D(D(D