In this paper we introduce a set of novel features for identifying underlying stochastic behavior of input data using the Karhunen-Loeve expansion. These novel features are constructed by applying a coordinate transformation based on the recent Functional Data Analysis theory for anomaly detection. The associated signal decomposition is an exact hierarchical tensor product expansion with known optimality properties for approximating stochastic processes (random fields) with finite dimensional function spaces. In principle these low dimensional spaces can capture most of the stochastic behavior of `underlying signals' in a given nominal class, and can reject signals in alternative classes as stochastic anomalies. Using a hierarchical finite dimensional expansion of the nominal class, a series of orthogonal nested subspaces is constructed for detecting anomalous signal components. Projection coefficients of input data in these subspaces are then used to train a Machine Learning (ML) classifier. However, due to the split of the signal into nominal and anomalous projection components, clearer separation surfaces of the classes arise. In fact we show that with a sufficiently accurate estimation of the covariance structure of the nominal class, a sharp classification can be obtained. This is particularly advantageous for situations with large unbalanced datasets. We formulate this concept and demonstrate it on a number of high-dimensional datasets in cancer diagnostics. This approach yields significant increases in accuracy over ML methods that use the original feature data. This method leads to a significant increase in precision and accuracy over the current top benchmarks for the Global Cancer Map (GCM) gene expression network dataset. Furthermore, tests from unbalanced semi-synthetic datasets created from the GCM data confirmed increased accuracy as the dataset becomes more unbalanced.
翻译:在本文中, 我们使用 Karhunen- Loeve 扩展来引入一组新颖的特性, 用于识别输入数据的基本随机行为。 这些新颖的特性是通过应用基于最近功能数据分析理论的坐标转换来构建的。 相关的信号分解是精确的等级高压产品膨胀, 以及已知的相近随机偏移过程( 随机字段) 的最佳性特性。 这些低维空间原则上可以捕捉到在特定名义类中“ 潜伏信号” 的多数随机行为, 并且可以在替代类中拒绝信号, 作为随机偏差异常。 使用基于最近功能数据分析理论的等级有限尺寸扩展, 用于检测异常信号组成部分。 这些子空间中输入数据的预测系数随后被用于训练机器学习( ML) 分类。 但是, 由于信号分解成名义和异常值预测组件, 更清晰的分类表层表面表面表面表面显示, 以足够精确的当前精度表示的精度放大度扩展数据, 高度数据结构将显示这种精确度数据结构的精确度结构 。 。 高度数据分析等级的模型中, 将显示, 高度数据分析等级的精确性数据结构将生成数据 。