Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method's advantage over the classical normalization techniques. Namely, the comparison is made with Rank, Quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods. We also evaluate the performance of C-SHIFT algorithm on real biological data.
翻译:基因技术是分析数千种基因基因的基因表达数据模式的有力工具。由于实验中的一些系统变化,原始基因表达数据往往被不受欢迎的技术噪音所混淆。在任何统计分析之前,设计了各种正常化技术,试图消除这些非生物错误。数据正常化的原因之一是需要恢复基因网络分析中使用的共变矩阵。在本文件中,我们采用了一种新型的正常化技术,称为常变转换(C-SHIFT)方法。这种正常化算法使用优化技术,加上在添加噪声(生物学中称为偏差)下利用维度理论和能量最小化假设来恢复共变矩阵。因此,它完全适合于分析对数基因表达数据。合成数据中的数值实验表明该方法比典型的正常化技术更有利。也就是说,我们与级、量、量、周期性LOESSS(当地估计的散落平法)和MAD(中度绝对偏差)方法进行了比较。我们还评估了C-SFIA的实性数据性。