A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affects drug-DNA interactions, but also promote or inhibit the expression of critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, Biological experimental techniques for measuring it are expensive and time consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information of bases in gene sequences. To address these issues, we proposed a new solution named SemanticCAP. It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of a certain site in gene sequences. Basically, we merge the features provided by the gene language model into our chromatin accessibility model. During the process, we designed some methods to make feature fusion smoother. Compared with other systems under public benchmarks, our model proved to have better performance.
翻译:大量无机和有机化合物能够结合DNA和形成复杂的基因组,其中与毒品有关的分子很重要。染色体的可获取性变化不仅直接影响药物-DNA相互作用,而且还通过影响TF和转录监管机构的DNA约束能力,促进或抑制与药物抗药性相关的关键基因的表达。然而,生物实验技术测量它的费用昂贵,耗费时间。近年来,提出了几种计算方法,以确定基因组的可获取区域。现有的计算模型大多忽略了基因序列基础的背景资料。为了解决这些问题,我们提出了一个名为SemanticCAP的新的解决方案。它引入了一种基因语言模型,用以模拟基因序列的背景,从而能够有效地代表基因序列中的某一地点。基本上,我们把基因语言模型提供的特征与我们的染色体可获取模型结合起来。在这一过程期间,我们设计了一些方法,使特征融合更加平滑。与公共基准下的其他系统相比,我们的模型证明效果更好。