The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide upper and lower bounds on the asymptotic rate of the underlying codebooks. Our code constructions combine properties of binary Bh and Dyck strings and that can be extended to accommodate missing substrings in the pool. As auxiliary results, we obtain the first known bounds on binary Bh sequences for arbitrary even parameters h, and also describe various error models inherent to mass spectrometry analysis. This paper contains a correction of the prior work by the authors, published in [24]. In particular, the bounds on the prefix codes are now corrected.
翻译:从子字符串信息中重建字符串的问题由于在基因组数据测序以及DNA和聚合物数据存储中的重要性而发现许多应用问题。一个实际重要和具有挑战性的范例是,需要根据质量光谱仪设备产生的前缀和后缀的组合组合来重建字符串的混合物。我们描述了新的编码方法,这些方法使得从一个代码中选择的字符串子集能够进行独特的联合重建,并为基础代码簿的无药率提供上下限。我们的代码构造结合了二进制 Bh 和 Dyck 字符串的特性,这些特性可以扩展以适应池中缺失的子字符串。作为辅助结果,我们获得了关于任意甚至参数h的二进制 Bh 序列的第一个已知界限,还描述了质量光谱分析所固有的各种错误模型。本文载有作者先前工作(在[24] 中发表) 的更正。特别是前缀代码的界限现在得到了更正。