Ever since deoxyribonucleic acid (DNA) was considered as a next-generation data-storage medium, lots of research efforts have been made to correct errors occurred during the synthesis, storage, and sequencing processes using error correcting codes (ECCs). Previous works on recovering the data from the sequenced DNA pool with errors have utilized hard decoding algorithms based on a majority decision rule. To improve the correction capability of ECCs and robustness of the DNA storage system, we propose a new iterative soft decoding algorithm, where soft information is obtained from FASTQ files and channel statistics. In particular, we propose a new formula for log-likelihood ratio (LLR) calculation using quality scores (Q-scores) and a redecoding method which may be suitable for the error correction and detection in the DNA sequencing area. Based on the widely adopted encoding scheme of the fountain code structure proposed by Erlich et al., we use three different sets of sequenced data to show consistency for the performance evaluation. The proposed soft decoding algorithm gives 2.3% ~ 7.0% improvement of the reading number reduction compared to the state-of-the-art decoding method and it is shown that it can deal with erroneous sequenced oligo reads with insertion and deletion errors.
翻译:自从DNA被视为下一代数据存储介质以来,已经进行了大量的研究工作,以使用纠错码(ECC)来纠正在合成、存储和测序过程中发生的错误。以前的关于从具有错误的测序DNA池中恢复数据的工作利用了基于大多数决策规则的硬解码算法。为了改善纠错码的校正能力和DNA存储系统的鲁棒性,我们提出了一种新的迭代软解码算法,其中从FASTQ文件和通道统计数据获取软信息。特别地,我们提出了一种使用质量分数(Q分数)计算对数似然比(LLR)的新公式以及一种可能适用于DNA测序区域的重新编码方法。基于Erlich等提出的泉码结构的广泛采用的编码方案,我们使用三组不同的测序数据来展示性能评估的一致性。所提出的软解码算法相较于最先进的解码方法,可使读数减少2.3%~7.0%,并且表现出能够处理具有插入和删除错误的错误测序寡核苷酸读取。