Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of the gene expression data. However, more complex analysis for classification and discovery of feature genes or sample observations requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though, the methods are discussed in the context of expression microarray data, they can also be applied for the analysis of RNA sequencing or quantitative proteomics datasets. We specifically discuss methods for missing value (gene expression) imputation, feature gene scaling, selection and extraction of features for dimensionality reduction, and learning and analysis of expression data. We discuss the types of missing values and the methods and approaches usually employed in their imputation. We also discuss methods of data transformation and feature scaling viz. normalization and standardization. Various approaches used in feature selection and extraction are also reviewed. Lastly, learning and analysis methods including class comparison, class prediction, and class discovery along with their evaluation parameters are described in detail. We have described the process of generation of a microarray gene expression data along with advantages and limitations of the above-mentioned techniques. We believe that this detailed review will help the users to select appropriate methods based on the type of data and the expected outcome.
翻译:包括机器学习在内的计算分析方法在基因组学和医学领域具有重大影响; 高通量基因表达分析方法,如微阵列技术和RNA测序等高通量基因表达分析方法,产生大量数据; 传统上,使用统计方法对基因表达数据进行比较分析; 然而,要对特性基因分类和发现或抽样观察进行更复杂的分析,就需要复杂的计算方法; 我们在本审查中, 汇编用于分析表达微阵列数据的各种统计和计算工具; 尽管这些方法是在表达微阵列数据的背景下讨论的, 也可以用于分析RNA测序或定量蛋组数据集等高通量基因表达方式分析方法; 我们专门讨论缺少值(gene表达)估算方法、特性缩放、选择和提取特征特征特征特征特征基因表达数据的方法; 我们讨论了缺失值类型分析通常使用的方法和方法; 我们还讨论了数据转换方法和特征缩放分集和标准化的方法; 在选择和提取时,还审查了在选择特征选择或定量蛋组数据集方面所使用的各种方法; 最后,学习和分析方法,包括分析分类结果分析方法的预估方法; 我们相信,这种分类结果分析方法的预估方法将比。