A single gene can encode for different protein versions through a process called alternative splicing. Since proteins play major roles in cellular functions, aberrant splicing profiles can result in a variety of diseases, including cancers. Alternative splicing is determined by the gene's primary sequence and other regulatory factors such as RNA-binding protein levels. With these as input, we formulate the prediction of RNA splicing as a regression task and build a new training dataset (CAPD) to benchmark learned models. We propose discrete compositional energy network (DCEN) which leverages the hierarchical relationships between splice sites, junctions and transcripts to approach this task. In the case of alternative splicing prediction, DCEN models mRNA transcript probabilities through its constituent splice junctions' energy values. These transcript probabilities are subsequently mapped to relative abundance values of key nucleotides and trained with ground-truth experimental measurements. Through our experiments on CAPD, we show that DCEN outperforms baselines and ablation variants.
翻译:由于蛋白质在细胞功能中起着主要作用,畸形样谱可导致各种疾病,包括癌症。替代样谱取决于基因的主要序列和其他调节因素,如RNA的蛋白质富集水平。用这些作为投入,我们将RNA的串流预测作为一种回归任务,并建立一个新的培训数据集,以作为学习模型的基准。我们建议离散组成能源网络(DCEN)利用螺旋体点、连接点和笔录之间的等级关系来应对这一任务。在替代样谱预测中,DCEN模型模型MRNA通过其组成锥形连接点的能量值来记录概率。这些抄录概率随后被绘制成关键核素的相对丰度值,并经过地面图谱实验测量培训。我们通过对CAPD的实验,我们显示DCEN超越了基线和反动变异体。