It is well-known that machine learning protocols typically under-utilize information on the probability distributions of feature vectors and related data, and instead directly compute regression or classification functions of feature vectors. In this paper we introduce a set of novel features for identifying underlying stochastic behavior of input data using the Karhunen-Lo\'{e}ve (KL) expansion, where classification is treated as detection of anomalies from a (nominal) signal class. These features are constructed from the recent Functional Data Analysis (FDA) theory for anomaly detection. The related signal decomposition is an exact hierarchical tensor product expansion with known optimality properties for approximating stochastic processes (random fields) with finite dimensional function spaces. In principle these primary low dimensional spaces can capture most of the stochastic behavior of `underlying signals' in a given nominal class, and can reject signals in alternative classes as stochastic anomalies. Using a hierarchical finite dimensional KL expansion of the nominal class, a series of orthogonal nested subspaces is constructed for detecting anomalous signal components. Projection coefficients of input data in these subspaces are then used to train an ML classifier. However, due to the split of the signal into nominal and anomalous projection components, clearer separation surfaces of the classes arise. In fact we show that with a sufficiently accurate estimation of the covariance structure of the nominal class, a sharp classification can be obtained. We carefully formulate this concept and demonstrate it on a number of high-dimensional datasets in cancer diagnostics. This method leads to a significant increase in precision and accuracy over the current top benchmarks for the Global Cancer Map (GCM) gene expression network dataset.
翻译:众所周知, 机器学习协议通常没有充分利用关于特性矢量和相关数据的概率分布的信息, 而是直接计算特性矢量的回归或分类功能。 在本文件中, 我们引入了一套新特征, 用以确定使用 Karhunen- Lo\'{e}ve (KL) 扩展的输入数据的基本随机行为, 将分类作为从一个( 名义) 信号类( 名义) 信号类中检测异常的检测方法。 这些特征是根据最近功能数据分析( FDA) 理论构建的, 以异常检测。 相关的信号分解是精确的等级 Exmoor 产品扩张, 其精确性性能与相似。 在本文中, 我们引入了已知的精确性化进程进程( 随机域域域域域域), 这些基本低度空间空间可以捕捉到在特定名义类中“ 隐藏信号” 的多数偏差行为, 并且可以拒绝替代类中的信号 。 在标值分类中, 我们构建了一个直系的子空间结构, 将一个用于检测 等星系 数据流的预测值 。