Kernel methods on discrete domains have shown great promise for many challenging data types, for instance, biological sequence data and molecular structure data. Scalable kernel methods like Support Vector Machines may offer good predictive performances but do not intrinsically provide uncertainty estimates. In contrast, probabilistic kernel methods like Gaussian Processes offer uncertainty estimates in addition to good predictive performance but fall short in terms of scalability. While the scalability of Gaussian processes can be improved using sparse inducing point approximations, the selection of these inducing points remains challenging. We explore different techniques for selecting inducing points on discrete domains, including greedy selection, determinantal point processes, and simulated annealing. We find that simulated annealing, which can select inducing points that are not in the training set, can perform competitively with support vector machines and full Gaussian processes on synthetic data, as well as on challenging real-world DNA sequence data.
翻译:离散域的内核方法对于许多具有挑战性的数据类型,例如生物序列数据和分子结构数据,都显示了巨大的希望。支持矢量机等可缩放内核方法可能提供良好的预测性性能,但并不在本质上提供不确定性估计。相反,高森进程等概率性内核方法除了提供良好的预测性能外,还提供不确定性估计,但在可缩放性方面却短于可缩放性。虽然使用稀薄的诱导点近似可以改进高斯过程的可缩放性,但选择这些导出点仍具有挑战性。我们探索了选择离散域导点的不同技术,包括贪婪选择、定点过程和模拟肛射。我们发现模拟的Annealing可以选择不在训练组内的引点,这些模拟Annealing能够通过支持矢量机和合成数据全高斯进程,以及具有挑战性的真实世界DNA序列数据,以竞争性的方式运行。