用于处理高维数据集的可缩放 MIRMR 特性选择: 基于垂直分隔的垂直分隔基基式迭代映射框架 (Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework)

While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.

翻译：虽然建立机器学习模型,但地物选择(FS)是处理数据不确定性和模糊性的基本预处理步骤。最近,最低限度冗余和最大相关性(MRMR)方法已证明在获得不可替换的特性子集方面是有效的。由于生成了大量的数据集,因此有必要利用分布式/平行模式设计可扩展的解决方案。地图生成解决方案被证明是设计容错和可缩放式解决方案的最佳方法之一。这项工作分析了用于MRMR特征选择的现有地图淡化方法,并确定了其局限性。在本研究中,我们提出了VMR_mRMR,一种有效的垂直分割法,采用混合法,从而克服现有方法的局限性。实验分析指出,VMRR_RMR大大超越了现有方法,实现了更好的计算收益(C.G)。此外,我们还与横向分割法进行了比较分析,以评估拟议方法的长处和局限性。