In many real world problems, features do not act alone but in combination with each other. For example, in genomics, diseases might not be caused by any single mutation but require the presence of multiple mutations. Prior work on feature selection either seeks to identify individual features or can only determine relevant groups from a predefined set. We investigate the problem of discovering groups of predictive features without predefined grouping. To do so, we define predictive groups in terms of linear and non-linear interactions between features. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups, without requiring candidate groups to be provided. The selected groups are sparse and exhibit minimum overlap. Furthermore, we propose a new metric to measure similarity between discovered groups and the ground truth. We demonstrate the utility of our model on multiple synthetic tasks and semi-synthetic chemistry datasets, where the ground truth structure is known, as well as an image dataset and a real-world cancer dataset.
翻译:在许多真实的世界问题中,特征并非单独行动,而是相互结合。例如,在基因组学中,疾病可能不是由任何单一的突变引起的,而是需要多种突变的存在。先前关于特征选择的工作要么寻求确定单个特征,要么只能从预先定义的一组中确定相关群体。我们调查在不预先定义分组的情况下发现预测特征群的问题。为了这样做,我们从各特征之间的线性和非线性互动的角度来界定预测群体。我们引入了一种新的深层次的学习结构,它使用特征选择模型的组合来寻找预测群体,而不需要提供候选群体。选定的群体稀少,并表现出最低限度的重叠。此外,我们提出了衡量被发现群体与地面真理之间相似性的新指标。我们展示了我们的模型在多个合成任务和半合成化学数据集方面的实用性,在地面真相结构为人所知的地方,以及图像数据集和真实世界癌症数据集。