Within machine learning model evaluation regimes, feature selection is a technique to reduce model complexity and improve model performance in regards to generalization, model fit, and accuracy of prediction. However, the search over the space of features to find the subset of $k$ optimal features is a known NP-Hard problem. In this work, we study metrics for encoding the combinatorial search as a binary quadratic model, such as Generalized Mean Information Coefficient and Pearson Correlation Coefficient in application to the underlying regression problem of price prediction. We investigate trade-offs in the form of run-times and model performance, of leveraging quantum-assisted vs. classical subroutines for the combinatorial search, using minimum redundancy maximal relevancy as the heuristic for our approach. We achieve accuracy scores of 0.9 (in the range of [0,1]) for finding optimal subsets on synthetic data using a new metric that we define. We test and cross-validate predictive models on a real-world problem of price prediction, and show a performance improvement of mean absolute error scores for our quantum-assisted method $(1471.02 \pm{135.6})$, vs. similar methodologies such as recursive feature elimination $(1678.3 \pm{143.7})$. Our findings show that by leveraging quantum-assisted routines we find solutions that increase the quality of predictive model output while reducing the input dimensionality to the learning algorithm on synthetic and real-world data.
翻译:在机器学习模型评价制度内,特征选择是一种技术,可以降低模型复杂性,改进模型在一般化、模型适合和预测准确性方面的性能。然而,对功能空间进行搜索,以寻找美元最佳特性子集是一个已知的NP-Hard问题。在这项工作中,我们研究将组合搜索编码为二进制的二次二次方位模型的衡量标准,如通用平均信息节能和皮尔逊相近效率,以应用于价格预测的根本性回归问题。我们以运行时间和模型性能的形式,即利用量子辅助法和经典的合成亚例进行组合式搜索,使用最小的冗余最大弹性作为我们的方法。我们用一个我们定义的新度来计算合成数据的最佳次元集。我们测试并交叉评估价格预测问题模型,并显示我们定量辅助方法的绝对误差分数的绩效改进 $(1471.02\p3.3) 用于进行组合式搜索,同时通过常规数据分析方法,我们用SUDRIFALS 增加实际值数据。