Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
翻译:包括机器学习在内的计算分析方法在基因组学和医学领域具有重大影响; 高通量基因表达分析方法,如微阵列技术和RNA测序等,产生大量数据; 传统上,使用统计方法对基因表达数据进行比较分析; 但是,要对抽样观察进行更复杂的分类分析,或发现特征基因,需要复杂的计算方法; 在这次审查中,我们汇编了用于分析表达微阵列数据的各种统计和计算工具; 尽管这些方法是在表达微阵列的范围内讨论的,但它们也可以用于分析RNA测序和定量蛋白质组数据集; 我们讨论了缺失的值的类型,以及通常用于估算这些值的方法和办法; 我们还讨论了数据正常化、特征选择和特征提取的方法; 最后,详细介绍了分类和类发现方法及其评价参数; 我们认为,这一详细审查将有助于用户根据预期结果选择适当的预处理和分析方法。