Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.
翻译:DNA分子中的信息存储非常有意义,因为它在长寿、高存储密度和低维护成本方面具有优势。 DNA存储管道中的一个关键步骤是根据相似之处有效地组合所回收的DNA序列。 Levenshtein 距离是衡量两个DNA序列之间相似性的最合适尺度,但在计算复杂性方面却不如计算性,而且与成熟的组群算法不相容。 在这项工作中,我们提出了一个新的深方位的Euclidean 嵌入,用于DNA序列,使用Siams神经网络、平方 Euclidean 嵌入和基方回归。 Levenshtein 距离以嵌入矢体矢体之间的正方位 Euclidean 距离为近似,这是快速计算和组合算法友好的。 对拟议方法进行了理论和实验分析。结果显示,提议的嵌入是高效和稳健的。