We study a fundamental problem in Bayesian learning, where the goal is to select a set of data sources with minimum cost while achieving a certain learning performance based on the data streams provided by the selected data sources. First, we show that the data source selection problem for Bayesian learning is NP-hard. We then show that the data source selection problem can be transformed into an instance of the submodular set covering problem studied in the literature, and provide a standard greedy algorithm to solve the data source selection problem with provable performance guarantees. Next, we propose a fast greedy algorithm that improves the running times of the standard greedy algorithm, while achieving performance guarantees that are comparable to those of the standard greedy algorithm. The fast greedy algorithm can also be applied to solve the general submodular set covering problem with performance guarantees. Finally, we validate the theoretical results using numerical examples, and show that the greedy algorithms work well in practice.
翻译:我们研究了巴伊西亚学习中的一个根本问题,即我们的目标是在根据选定数据源提供的数据流取得一定的学习性能的同时,以最低成本选择一组数据源。 首先,我们表明巴伊西亚学习的数据源选择问题是NP-hard。 然后我们表明,数据源选择问题可以转化成一个子模块,涵盖文献中研究的问题,并提供标准的贪婪算法,用可验证的性能保证解决数据源选择问题。 其次,我们提议一种快速贪婪算法,改进标准贪婪算法的运行时间,同时实现与标准贪婪算法类似的性能保证。快速贪婪算法也可以用于解决包含性能保障问题的一般子模块。最后,我们用数字实例验证理论结果,并表明贪婪算法在实践中运作良好。