Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all other items in the population. For instance, if you have a set of the most dissimilar n=4 items, one or more of the items from n=4 might not be in the set n=5. An exact solution would have to search all possible combinations of size n in the population, exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but is also more accurate than running the Monte Carlo for 1,000,000 iterations, each searching for set sizes n=3-7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing SDS produces a solution close to the exact solution in these instances.
翻译:在较大人口群体中查找最不同的 n 项集,随着n 或人口规模的扩大而变得日益困难,计算成本也越来越昂贵。 查找最不同的项集, 不同于简单的排序数字阵列, 因为每个项目和人口中所有其他项目之间存在一种对等关系。 例如, 如果您有一组最不同的 n=4 项, 从 n=4 项中的一个或多个项可能不会在设定 n= = 5 中找到确切的解决方案。 准确的解决方案是, 要找到人口中最不同的尺寸组合, 就必须详尽地搜索。 我们展示了一个叫做“ 类似” 的开放源代码软件, 名为“ 缩小选择”, 以 Python 书写, 并在 GitHub 上自由提供。 SDS 使用一种超自然的算法, 快速查找最不相近的项的大约的 n= 4 项。 我们用一个 Monte Carlo 方法对SDS 进行基准, 试图通过重复随机抽样来找到准确的解决方案。 我们的SDS 找到最相异的数据集组组合, 我们的方法不单级, 我们的方法不仅要快得多,, 而且还要显示一个比正在搜索的N- 10万 CL 的答案的精确的大小。