There are hundreds of methods for analysis of data obtained in mRNA-sequencing. The most of them are focused on small number of genes. In this study, we propose an approach that reduces the analysis of several thousand genes to analysis of several clusters. The list of genes is enriched with information from open databases. Then, the descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches. The encoded gene function pass through the dimensionality reduction and clusterization. Aiming to find the most efficient pipeline, 180 cases of pipeline with different methods in the major pipeline steps were analyzed. The performance was evaluated with clusterization indexes and expert review of the results.
翻译:有数百种分析在 mRNA 序列中获得的数据的方法,其中多数侧重于少量基因。在本研究中,我们建议采用一种方法,将数千个基因的分析减少至对数组的分析。基因清单以开放数据库的信息丰富。然后,这些描述通过预先培训的语言模型和一些文本处理方法被编码为矢量。编码的基因功能通过维度减少和集群化过程传过。为了找到最有效的管道,分析了180个在主要管道步骤中采用不同方法的管道案例。用集束指数和对结果的专家审查对业绩进行了评估。