Feature selection is an important process in machine learning. It builds an interpretable and robust model by selecting the features that contribute the most to the prediction target. However, most mature feature selection algorithms, including supervised and semi-supervised, fail to fully exploit the complex potential structure between features. We believe that these structures are very important for the feature selection process, especially when labels are lacking and data is noisy. To this end, we innovatively introduce a deep learning-based self-supervised mechanism into feature selection problems, namely batch-Attention-based Self-supervision Feature Selection(A-SFS). Firstly, a multi-task self-supervised autoencoder is designed to uncover the hidden structure among features with the support of two pretext tasks. Guided by the integrated information from the multi-self-supervised learning model, a batch-attention mechanism is designed to generate feature weights according to batch-based feature selection patterns to alleviate the impacts introduced by a handful of noisy data. This method is compared to 14 major strong benchmarks, including LightGBM and XGBoost. Experimental results show that A-SFS achieves the highest accuracy in most datasets. Furthermore, this design significantly reduces the reliance on labels, with only 1/10 labeled data needed to achieve the same performance as those state of art baselines. Results show that A-SFS is also most robust to the noisy and missing data.
翻译:机器学习中的一个重要过程是选择特征选择。 它通过选择最有助于预测目标的特征, 构建了一个可以解释和稳健的模型。 但是, 最成熟的特征选择算法, 包括监督的和半监督的功能选择算法, 未能充分利用各种特征之间的复杂潜在结构。 我们认为, 这些结构对于特征选择过程非常重要, 特别是当标签缺乏和数据吵闹的时候。 为此, 我们创新地将基于深层次学习的自我监督机制引入特征选择问题, 即基于批量的基于自我监督的自我监督的特征选择( A- SFS) 。 首先, 多任务自我监督的自我监督自动编码器设计, 目的是在两种托辞任务的支持下发现各特征之间的隐藏结构。 我们认为, 这些结构结构对于功能选择过程非常重要, 特别是当缺少标签时, 批量使用机制生成基于批量特征的选择模式的特征选择模式, 以减轻少数杂乱数据带来的影响。 这种方法被比作14个主要强的基准, 包括 LightGBM 和 XGBOost。 实验结果显示, A-SFSFS 最可靠的数据在最可靠数据设计中也显示, 最可靠的数据标准中, 也大大降低了这些标签所需的数据要求。