Adenocarcinoma and squamous cell carcinoma constitute approximately 40% and 30% of all lung cancer subtypes, respectively, and display broad heterogeneity in terms of clinical and molecular responses to therapy. Molecular subtyping has enabled precision medicine to overcome these challenges and provide significant biological insights to predict prognosis and improve clinical decision making. Over the past decade, conventional ML algorithms and DL-based CNNs have been espoused for the classification of cancer subtypes from gene expression datasets. However, these methods are potentially biased toward identification of cancer biomarkers. Recently proposed transformer-based architectures that leverage the self-attention mechanism encode high throughput gene expressions and learn representations that are computationally complex and parametrically expensive. However, compared to the datasets for natural language processing applications, gene expression consists of several hundreds of thousands of genes from a limited number of observations, making it difficult to efficiently train transformers for bioinformatics applications. Hence, we propose an end-to-end deep learning approach, Gene Transformer, which addresses the complexity of high-dimensional gene expression with a multi-head self-attention module by identifying relevant biomarkers across multiple cancer subtypes without requiring feature selection as a prerequisite for the current classification algorithms. The proposed architecture achieved an overall improved performance for all evaluation metrics and had fewer misclassification errors than the commonly used traditional classification algorithms. The classification results show that Gene Transformer can be an efficient approach for classifying cancer subtypes, indicating that any improvement in deep learning models in computational biology can also be reflected well in this domain.
翻译:肾上腺素瘤和阴性细胞癌癌分别占所有肺癌子型的大约40%和30%左右,在临床和分子对治疗的治疗反应方面表现出广泛的异质性。分子亚型使得精密医学能够克服这些挑战,提供了重要的生物洞察力来预测预测病情并改进临床决策。在过去的十年中,传统ML算法和基于DL的CNN被支持从基因表达数据集中对癌症子型进行分类。然而,这些方法可能偏向于癌症生物标志的识别。最近提出的基于变压器的架构,利用了深度自留机制对高输血基因表达方式进行编码,并学习了计算上复杂和相当昂贵的表达方式。然而,与自然语言处理应用程序的数据集相比,基因表达方式包括数以百计的基因,因此难以有效地对基因变异器进行生物文学应用的训练。因此,我们提议采用一种尾端至端的深级学习方法,即Gene 变换器,用以利用深层次的内分级结构来解释高层次基因结构的复杂性结构,用以在生物分类中显示一种常位的精度的精细的基因分析,在生物分类结构中,在生物分类中显示一种常态的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度分析。