In this paper, we propose a wrapper for feature subset selection (FSS) based on parallel and distributed hybrid evolutionary algorithms viz., parallel binary differential evolution and threshold accepting (PB-DETA), parallel binary threshold accepting and differential evolution (PB-TADE) under the Apache Spark environment. Here, the FSS is formulated as a combinatorial optimization problem. PB-TADE comprises invoking two optimization algorithms i.e., TA and BDE in tandem in every iteration, while in PB-DETA, BDE is invoked first before TA takes over in tandem in every iteration. In addition to these hybrids, parallel binary differential evolution (P-BDE), is also developed to investigate the role played by TA and for baseline comparison. For all the three proposed approaches, logistic regression (LR) is used to compute the fitness function namely, the area under ROC curve (AUC) score. The effectiveness of the parallel and distributed wrappers is assessed over five large datasets of varying feature space dimension pertaining to the cyber security and biology domains. It is noteworthy that the PB-TADE turned out to be statistically significant compared to P-BDE and PB-DETA. The speed up is reported with respect to the sequential version of the three wrappers. Average AUC score obtained, most repeated feature subsets, feature subsets with least cardinality having best AUC score are also reported. Further, our proposed methods outperformed the state-of-the-art results, wherever the results were reported.
翻译:在本文中,我们提出一个基于平行和分布的混合进化算法(PB-DETA)、平行二进制进化和门槛接受(PB-DETA)、平行二进制门槛接受和差异进化(PB-TADE)的特征子集选择(FSS)的包装。这里,FSS是一个组合优化问题。PB-TADE包含在每一次迭代中同时援引两种优化算法,即TA和BDE,而在PB-DETA中,BDE首先在TA同时在每次迭代中被援引。除了这些混合,平行二进制进化和门槛接受(PP-DETA)、平行二进制门槛接受和差异进化(P-DEDADA)之外,还开发了平行进化门槛(PBB-TADA)的作用并用于基线比较。关于所有三种拟议方法,即ROC曲线(AUC)下区域评分数,平行和分布式包装包件的效力被评估超过五套与网络安全和生物域不同特征的大型数据集。值得注意的是,PB-TA级评分级评分数(P-DE-A-A-A)的评分级计算方法被报告为具有重要分级比。