A key task of data science is to identify relevant features linked to certain output variables that are supposed to be modeled or predicted. To obtain a small but meaningful model, it is important to find stochastically independent variables capturing all the information necessary to model or predict the output variables sufficiently. Therefore, we introduce in this work a framework to detect linear and non-linear dependencies between different features. As we will show, features that are actually functions of other features do not represent further information. Consequently, a model reduction neglecting such features conserves the relevant information, reduces noise and thus improves the quality of the model. Furthermore, a smaller model makes it easier to adopt a model of a given system. In addition, the approach structures dependencies within all the considered features. This provides advantages for classical modeling starting from regression ranging to differential equations and for machine learning. To show the generality and applicability of the presented framework 2154 features of a data center are measured and a model for classification for faulty and non-faulty states of the data center is set up. This number of features is automatically reduced by the framework to 161 features. The prediction accuracy for the reduced model even improves compared to the model trained on the total number of features. A second example is the analysis of a gene expression data set where from 9513 genes 9 genes are extracted from whose expression levels two cell clusters of macrophages can be distinguished.
翻译:数据科学的一项关键任务是确定与某些产出变量相关的相关特征,这些变量本应建模或预测。为了获得一个小型但有意义的模型,重要的是要找到能够捕捉模型或充分预测产出变量所需的全部信息的随机独立的变量。因此,我们在此工作中引入一个框架,以检测不同特征之间的线性和非线性依赖性。正如我们将要显示的那样,实际上属于其他特征的功能并不代表进一步的信息。因此,忽略这些特征的模型减少会保存相关信息,减少噪音,从而改进模型的质量。此外,一个较小的模型使得采用一个特定系统的模型更容易。此外,所有考虑过的特征中的方法结构依赖所有特征。这为古典模型从回归到差异方程和机器学习提供了优势。要显示一个数据中心2154框架的概括性和适用性,并且为数据中心的错误和非错误状态的分类设置一个模型。这个特征的数量通过框架自动减少到161个模型的模型。此外,方法结构结构结构结构结构取决于所有被考虑的特征。所有特征中的所有特征都取决于方法结构结构结构结构。这为从回归到差异方形形形形等的模型的精确性提供了模型分析。95个模型的精确性,因此,对基因组的精度进行了分析。