Selecting a small set of informative features from a large number of possibly noisy candidates is a challenging problem with many applications in machine learning and approximate Bayesian computation. In practice, the cost of computing informative features also needs to be considered. This is particularly important for networks because the computational costs of individual features can span several orders of magnitude. We addressed this issue for the network model selection problem using two approaches. First, we adapted nine feature selection methods to account for the cost of features. We show for two classes of network models that the cost can be reduced by two orders of magnitude without considerably affecting classification accuracy (proportion of correctly identified models). Second, we selected features using pilot simulations with smaller networks. This approach reduced the computational cost by a factor of 50 without affecting classification accuracy. To demonstrate the utility of our approach, we applied it to three different yeast protein interaction networks and identified the best-fitting duplication divergence model.
翻译:从大量可能吵闹的候选人中选择一小套信息特征是一个具有挑战性的问题,在机器学习和近似贝叶斯计算中有许多应用软件。实际上,计算信息特征的成本也需要考虑。这对于网络来说尤其重要,因为单个特征的计算成本可以跨越几个数量级。我们用两种方法解决网络模式选择问题。首先,我们调整了九个特征选择方法,以计算功能成本。我们为两类网络模型显示,成本可以减少两个数量级,而不会严重影响分类准确性(正确确定模型的比例)。第二,我们利用与较小网络的试点模拟方法选择了特征。这种方法将计算成本降低50倍,而不影响分类准确性。为了证明我们的方法的效用,我们将其应用于三个不同的酵素蛋白互动网络,并确定了最合适的重复差异模式。