In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by $\mathcal{O}(n^2)$, where $n$ is the length of the string. We introduce a baseline algorithm, running in $\mathcal{O}(n^2)$ time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in $\mathcal{O}(n^2\log n)$ time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.
翻译:在本文中, 我们考虑如何识别彩色字符串中感兴趣的模式。 彩色字符串是一个字符串, 每个位置被指定为一定的颜色。 我们的任务是找到颜色字符串的子字符串, 并且总是在同一距离以相同的颜色出现。 问题是由嵌入系统校验中的应用程序, 特别是主张采矿 引发的。 目标是从模拟轨迹分析中自动找到嵌入系统的属性。 我们显示, 在我们的设置中, 感兴趣的模式数量由$\mathca{O} (n%2) 来上调一个字符串的字符串。 我们的任务是找到颜色字符串的子字符串的子字符串的子字符串。 我们引入了一个基线算法, 以$\mathcal{O} (n%2) 来运行。 我们引入了一种基线算法的子, 并且用所有最小的运算法, 并且用一个精确的运算法, 并且用一个精确的运算法, 并且用一个精确的运算法, 将我们所有的运算法的精度的精度 。