Normalization and outlier detection belong to the preprocessing of gene expression data. We propose a natural normalization procedure based on statistical data depth which normalizes to the distribution of gene expressions of the most representative gene expression of the group. This differ from the standard method of quantile normalization, based on the coordinate-wise median array that lacks of the well-known properties of the one-dimensional median. The statistical data depth maintains those good properties. Gene expression data are known for containing outliers. Although detecting outlier genes in a given gene expression dataset has been broadly studied, these methodologies do not apply for detecting outlier samples, given the difficulties posed by the high dimensionality but low sample size structure of the data. The standard procedures used for detecting outlier samples are visual and based on dimension reduction techniques; instances are multidimensional scaling and spectral map plots. For detecting outlier genes in a given gene expression dataset, we propose an analytical procedure and based on the Tukey's concept of outlier and the notion of statistical depth, as previous methodologies lead to unassertive and wrongful outliers. We reveal the outliers of four datasets; as a necessary step for further research.
翻译:普通化和外部检测属于基因表达式数据的预处理。 我们提议基于统计数据深度的自然正常化程序, 与该组最有代表性的基因表达式的基因表达式的分布正常化。 这不同于基于协调的中位阵列的四分点正常化标准方法, 该中位阵列缺乏一维中位的已知特性。 统计数据深度保持这些良好的特性。 基因表达式数据以包含外部值而著称。 虽然在特定基因表达式数据集中检测外部基因已经进行了广泛的研究,但这些方法并不适用于探测外部样本,因为数据具有高维度,但样本规模较低。 用于检测外部样本的标准程序是视觉的, 以尺寸减小技术为基础; 实例是多维的缩放和光谱图图图图。 为了在特定基因表达数据集中检测外部基因,我们建议了一个分析程序, 并以Tukey的外部值概念和统计深度概念为基础, 因为先前的方法导致不稳和错误的外部值。 我们揭示了四个数据集的外部值,作为进一步研究的必要步骤。