We present a new data analysis perspective to determine variable importance regardless of the underlying learning task. Traditionally, variable selection is considered an important step in supervised learning for both classification and regression problems. The variable selection also becomes critical when costs associated with the data collection and storage are considerably high for cases like remote sensing. Therefore, we propose a new methodology to select important variables from the data by first creating dependency networks among all variables and then ranking them (i.e. nodes) by graph centrality measures. Selecting Top-$n$ variables according to preferred centrality measure will yield a strong candidate subset of variables for further learning tasks e.g. clustering. We present our tool as a Shiny app which is a user-friendly interface development environment. We also extend the user interface for two well-known unsupervised variable selection methods from literature for comparison reasons.
翻译:我们提出了一种新的数据分析视角,无论基础学习任务如何,都可以确定变量的重要性。传统上,变量选择在监督学习中的分类和回归问题中被认为是一个重要步骤。当与数据收集和存储相关的成本相当高时,例如遥感应用,变量选择也变得至关重要。因此,我们提出了一种新的方法来从数据中选择重要的变量,首先创建所有变量之间的依赖网络,然后通过图中心性度量对它们进行排名(即节点)。根据首选中心性度量选择前$n$个变量,将为进一步的学习任务(例如聚类)提供一个强有力的候选变量子集。我们将我们的工具呈现为一个Shiny应用程序,这是一个用户友好的界面开发环境。我们还为两种来自文献的着名的无监督变量选择方法扩展了用户界面,以进行比较。