We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Because of applications to DNA-based data storage, due to DNA sequencing techniques, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model, where information is subject to noise \emph{before} its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases.
翻译:我们考虑子字符串频道的编码问题,即信息字符串仅通过其(多重)子字符串观测到。由于DNA数据储存的应用,由于DNA排序技术,近年来对这个频道的兴趣重新抬头。与现有文献不同,我们考虑一个噪音的频道模型,即信息受噪音的干扰,其子字符串是抽样的,由动态存储驱动。我们研究两种不同的噪音模型、替代或删除。在这两种情况下,我们研究可用于错误校正和当前组合界限的代码的组别。我们通过对重复自由字符串概念的概括化,我们表明由于这一不完善的观察假设而增加的冗余是子线性,要么在观察到的子字符串长度的误差部分足够小,要么在时间足够长的情况下。这表明,在这些情况下,这种频道模型不会产生比例上的微量成本。