Variable selection is crucial in high-dimensional omics-based analyses, since it is biologically reasonable to assume only a subset of non-noisy features contributes to the data structures. However, the task is particularly hard in an unsupervised setting, and a priori ad hoc variable selection is still a very frequent approach, despite the evident drawbacks and lack of reproducibility. We propose a Bayesian variable selection approach for rank-based transcriptomic analysis. Making use of data rankings instead of the actual continuous measurements increases the robustness of conclusions when compared to classical statistical methods, and embedding variable selection into the inferential tasks allows complete reproducibility. Specifically, we develop a novel extension of the Bayesian Mallows model for variable selection that allows for a full probabilistic analysis, leading to coherent quantification of uncertainties. We test our approach on simulated data using several data generating procedures, demonstrating the versatility and robustness of the method under different scenarios. We then use the novel approach to analyse genome-wide RNAseq gene expression data from ovarian cancer samples: several genes that affect cancer development are correctly detected in a completely unsupervised fashion, showing the method usefulness in the context of signature discovery for cancer genomics. Moreover, the possibility to also perform uncertainty quantification plays a key role in the subsequent biological investigation.
翻译:在基于高维的奥米克分析中,变量选择至关重要,因为仅假设一组非鼻子特征有助于数据结构,从生物学上讲是合理的。然而,任务在不受监督的环境中特别困难,先验性随机变量选择仍然是非常频繁的做法,尽管存在明显的缺陷和缺乏可复制性。我们建议采用巴伊西亚变量选择方法进行基于等级的笔录学分析。使用数据排序而不是实际持续测量,在与古典统计方法相比,提高结论的稳健性,并将变量选择纳入从卵巢癌样本中提取的基因表达数据可以完全再复制。具体地说,我们开发了一种新颖的Bayesian Malows 模型用于变量选择的扩展,允许进行全面的概率分析,从而导致对不确定性进行一致的量化。我们用几种数据生成程序测试了我们关于模拟数据的方法,显示了不同情景下方法的多功能性和稳健性。我们随后使用新方法分析了从卵巢癌样本样本样本中分析整个基因组的基因组基因表达数据,从而可以完全复制。一些基因组的变异性模型在生物级研究中扮演了一种关键的识别方法。