We consider the variable selection problem for two-sample tests, aiming to select the most informative features to best distinguish samples from two groups. We propose a kernel maximum mean discrepancy (MMD) framework to solve this problem and further derive its equivalent mixed-integer programming formulations for linear, quadratic, and Gaussian types of kernel functions. Our proposed framework admits advantages of both computational efficiency and nice statistical properties: (i) A closed-form solution is provided for the linear kernel case. Despite NP-hardness, we provide an exact mixed-integer semi-definite programming formulation for the quadratic kernel case, which further motivates the development of exact and approximation algorithms. We propose a convex-concave procedure that finds critical points for the Gaussian kernel case. (ii) We provide non-asymptotic uncertainty quantification of our proposed formulation under null and alternative scenarios. Experimental results demonstrate good performance of our framework.
翻译:我们考虑了两样样本测试的可变选择问题,目的是选择信息最丰富的特征,以便从两个组中最佳地区分样本。我们提议了一个最大平均差异(MMD)框架来解决这个问题,并进一步得出其线性、二次和高斯内核功能等同的混合整数编程配方。我们提议的框架承认计算效率和良好的统计属性的优点:(一)为线性内核案例提供了封闭式的解决方案。尽管NP-硬性,但我们为二次内核案例提供了精确的混合整数半确定式编程配方,这进一步推动了精确和近似算法的开发。我们提议了一个对高斯内核案例找出临界点的矩形剖面程序。 (二)我们提供了在无效和替代情景下对拟议配方进行非象征性的不确定性量化。实验结果显示了我们框架的良好表现。