We construct a compact vector representation on $\mathbb{R}24$ of a DNA sequence of arbitrary length. Each component of this vector is obtained from a representative sequence, the elements of which are the values realized by a function $\Gamma$. The function $\Gamma$, so defined, acts on neighborhoods of arbitrary radius that are located at strategic positions within the DNA sequence. $\Gamma$ carries complete information about the local multiplicity of the nucleotides as a consequence of the uniqueness of prime factorisation of integer. The two parameters characterizing the radius and location of the neighbourhoods are fixed by comparing the phylogenetic tree we find through our algorithm with standard results for the $\beta$ -globin gene sequences of eleven different species. Remarkably, the time complexity for this similarity analysis turns out to be $\mathcal{O}(n)$. Using the values of the two fitting parameters so obtained, the method is further applied to analyze mitochondrial genome sequences.
翻译:我们在 $\mathbb{R}24$ 上构建了 DNA 序列的压缩向量表示法,该向量的每个分量都来自代表性序列,该序列的元素是由函数 $\Gamma$ 实现的值。所定义的函数 $\Gamma$ 作用于任意半径的邻域,这些邻域位于 DNA 序列内的战略位置。由于整数的质因数唯一分解,$\Gamma$ 传递了关于核苷酸的局部重复度的完整信息。通过比较我们通过算法找到的系统发育树与 11 种不同物种的 $\beta$-globin 基因序列的标准结果,我们固定了两个参数,这两个参数表征了邻域的半径和位置。值得注意的是,这种相似性分析的时间复杂性为 $\mathcal{O}(n)$。利用所得到的两个拟合参数的值,该方法进一步应用于分析线粒体基因组序列。