Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N \ge M/T$ is a classic problem typically solved practically in $O(M \log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M \log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M \log (T/ \bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $\bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.
翻译:将含有$M元素的非冗余列表排序为单类非冗余列表, 包含$M元素的单类非冗余列表 $N\ ge M/ T$是一个典型的问题,典型的典型问题通常在美元( M\ log T) 时间以美元( log T) 时间以优先队列数据结构实际解决, 其中最基本的是简单的 *heap * 。 在列表元素* 字符串 * 和列表包含许多 * 相同或几乎相同的元素 * 的情况下,我们重新审视了这一问题。 通过在每层节点上保留简单的辅助信息,我们设计了美元( M\ log T+S) 最坏的情景方法,该方法的字符比较不超过所有字符串的长度总和 $S$, 另一种是 $O( M\ log ( T/\ bar e)+S) 方法,该方法随着等元素的分数 $\ e = M/ N$的函数而逐渐变得效率更高, 在列表完全相同时, 达到线性时间。 方法在实践上表现优于基于三款的替代配方 。