Most DNA sequencing technologies are based on the shotgun paradigm: many short reads are obtained from random unknown locations in the DNA sequence. A fundamental question, studied in arXiv:1203.6233, is what read length and coverage depth (i.e., the total number of reads) are needed to guarantee reliable sequence reconstruction. Motivated by DNA-based storage, we study the coded version of this problem;i.e., the scenario where the DNA molecule being sequenced is a codeword from a predefined codebook. Our main result is an exact characterization of the capacity of the resulting shotgun sequencing channel as a function of the read length and coverage depth. In particular, our results imply that, while in the uncoded case, $O(n)$ reads of length greater than $2\log{n}$ are needed for reliable reconstruction of a length-$n$ binary sequence, in the coded case, only $O(n/\log{n})$ reads of length greater than $\log{n}$ are needed for the capacity to be arbitrarily close to $1$.
翻译:DNA测序技术大多以猎枪模式为基础:许多短读来自DNA序列中随机未知地点。在ArXiv:1203.36233中研究的一个根本问题是,要保证可靠的序列重建,需要读数长度和覆盖范围深度(即读数总数)才能保证可靠的序列重建。我们受基于DNA的储存的驱动,研究这个问题的编码版本;即,正在测序的DNA分子是预先定义的编码手册的编码词。我们的主要结果是,对由此产生的猎枪测序通道的能力进行精确的描述,将它作为读数长度和覆盖范围深度的函数。特别是,我们的结果意味着,在未编码的案例中,美元(n)值的长度大于2美元(log{n}美元),对于一个长度-n美元二元序列的可靠重建来说,在编码的案例中,只需要$O(n/log{n}美元(n)值大于美元($/log{n},对于任意接近1美元的能力来说,只需要$(n/log{n}。