In ongoing work to define a principled method for syntenic block discovery and structuring, work based on homology-derived constraints and a generalization of common intervals, we faced a fundamental computational problem: how to determine quickly, among a set of indeterminate strings (strings whose elements consist of subsets of characters), contiguous intervals that would share a vast majority of their elements, but allow for sharing subsets of characters subsumed by others, and also for certain elements to be missing from certain genomes. An algorithm for this problem in the special case of determinate strings (where each element is a single character of the alphabet, i.e., "normal" strings) was described by Doerr et al., but its running time would explode if generalized to indeterminate strings. In this paper, we describe an algorithm for computing these special common intervals in time close to that of the simpler algorithm of Doerr et al. and show that can compute these intervals in just a couple of hours for large collections (tens to hundreds) of bacterial genomes.
翻译:在定义合成块发现和构造的基础方法中,研究人员通过基于同源导出的约束和公共区间的一般化方法,面临着一个基本的计算问题:如何快速确定一组不确定字符串(元素包含字符子集的字符串)之间的连续区间,这些连续区间将共享绝大多数元素,但允许共享由其他元素包含的字符子集,并且也允许某些元素丢失在特定的基因组中。Doerr等人描述了一个该特殊情况下(其中每个元素是字母表中的一个字符,即“正常”字符串)的算法,但是如果推广到不确定字符串,其运行时间会急剧增加。在本文中,我们描述了一个算法,在时间接近Doerr等人所提出的简单算法的情况下计算这些特殊的公共区间,并展示对于大型集合(上百个)的细菌基因组,我们可以在几个小时内计算这些区间。