We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Because of applications to DNA-based data storage, due to DNA sequencing techniques, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model, where information is subject to noise \emph{before} its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model.
翻译:我们考虑子字符串频道的编码问题,即信息字符串仅通过其(多重)子字符串观测到。由于DNA数据储存的应用,由于DNA排序技术,近年来对这个频道的兴趣重新出现。与现有文献不同,我们考虑一个吵闹的频道模型,在这个模型中,信息受到噪音的困扰,其子字符串是由动态存储驱动的。我们研究两种不同的噪音模型、替代或删除。在这两种情况下,我们检查可用于错误校正和当前组合界限的代码的组合。我们通过对重复自由字符串概念的概括化,我们表明由于这一不完善的观察假设而增加的冗余是子线性,要么在所观察到的子字符串长度的误差部分足够小,要么在时间足够长的情况下。这说明这个频道模型不会产生低温的成本。