The problem of selecting a handful of truly relevant variables in supervised machine learning algorithms is a challenging problem in terms of untestable assumptions that must hold and unavailability of theoretical assurances that selection errors are under control. We propose a distribution-free feature selection method, referred to as Data Splitting Selection (DSS) which controls False Discovery Rate (FDR) of feature selection while obtaining a high power. Another version of DSS is proposed with a higher power which "almost" controls FDR. No assumptions are made on the distribution of the response or on the joint distribution of the features. Extensive simulation is performed to compare the performance of the proposed methods with the existing ones.
翻译:在受监督的机器学习算法中选择少数真正相关的变量是一个具有挑战性的问题,因为无法检验的假设必须持有,而且无法从理论上保证选择错误得到控制。我们提出了一种不分发特性的选择方法,称为数据分割选择(DSS),在获得高功率的同时控制功能选择的虚假发现率(FDR)。另一个版本的DSS是高功率,“几乎”控制FDR。对于响应的分布或功能的联合分布,没有作出假设。进行了广泛的模拟,将拟议方法的性能与现有方法的性能进行比较。