When there are signals and noises, physicists try to identify signals by modeling them, whereas statisticians oppositely try to model noise to identify signals. In this study, we applied the statisticians' concept of signal detection of physics data with small-size samples and high dimensions without modeling the signals. Most of the data in nature, whether noises or signals, are assumed to be generated by dynamical systems; thus, there is essentially no distinction between these generating processes. We propose that the correlation length of a dynamical system and the number of samples are crucial for the practical definition of noise variables among the signal variables generated by such a system. Since variables with short-term correlations reach normal distributions faster as the number of samples decreases, they are regarded to be ``noise-like'' variables, whereas variables with opposite properties are ``signal-like'' variables. Normality tests are not effective for data of small-size samples with high dimensions. Therefore, we modeled noises on the basis of the property of a noise variable, that is, the uniformity of the histogram of the probability that a variable is a noise. We devised a method of detecting signal variables from the structural change of the histogram according to the decrease in the number of samples. We applied our method to the data generated by globally coupled map, which can produce time series data with different correlation lengths, and also applied to gene expression data, which are typical static data of small-size samples with high dimensions, and we successfully detected signal variables from them. Moreover, we verified the assumption that the gene expression data also potentially have a dynamical system as their generation model, and found that the assumption is compatible with the results of signal extraction.
翻译:当存在信号和噪声时,物理学家尝试通过建模来识别信号,而统计学家则尝试通过建模噪声来识别信号。在本研究中,我们运用了物理数据的信号检测统计学概念,而无需对信号进行建模,解决了高维数据的小样本问题。自然界大部分数据,无论是噪声还是信号,都被认为是由动力系统生成的。因此,这些生成过程之间本质上没有区别。我们提出了动力系统的相关长度和样本数量对于在不对信号进行建模的条件下确定信号变量中的噪声变量至关重要。由于短期相关性的变量随着样本数的减少更容易达到正态分布,因此它们被认为是“类噪声”变量,而相反特性的变量是“类信号”变量。正态性检验对于高维小样本数据并不有效。因此,我们基于噪声变量的属性,即变量是噪声的概率的均匀性,对噪声进行了建模。我们制定了一种方法,根据样本数的减少,检测直方图的结构变化,从而检测信号变量。我们将该方法应用于全局耦合映射生成的数据上,该数据可以生成具有不同相关长度的时间序列数据,并应用于基因表达数据上,这是具有典型高维小样本的静态数据,我们成功地从中检测到信号变量。此外,我们验证了基因表达数据也具有潜在的动力系统作为它们的生成模型,并发现该假设符合信号提取结果。