通用子串重构算法 (Generalized Unique Reconstruction from Substrings)

from arxiv, Author-submitted, peer-reviewed and accepted version (IEEE Trans. on Inform. Theory). arXiv admin note: text overlap with arXiv:2205.03933

This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length $\ell$ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of $\ell$ that asymptotically behave like the lower bound.

翻译：本文介绍了一种新的重构代码家族，该家族的动机来自DNA数据存储和测序应用。在这样的应用中，通过读取某些子串的子集来对DNA链进行测序。虽然之前的研究考虑到了所有预定义长度的子串都被读取或单个字符串情况下没有重叠的子串都被读取的两个极端情况，但本文研究了这种范例的两个扩展。第一个扩展考虑的是连续子串用一定的最小重叠读取的情况。首先，对于保证唯一重构的编码提供了一个上界。然后，提出了有效的编码构造，它们渐近地达到了上界。在第二个扩展中，我们研究了多个字符串一起重构的情况。给定字符串的数量和长度，首先推导了读取子串长度$\ \ell$的下界，该下界是保证多股重构代码存在且具有非零速率所必需的。然后，我们提出了两个这种代码的构造，并证明了它们的速率在$\ \ell $值渐近地接近于下界时趋近于1。