Locating the promoter region in DNA sequences is of paramount importance in the field of bioinformatics. This is a problem widely studied in the literature, however, not yet fully resolved. Some researchers have presented remarkable results using convolution networks, that allowed the automatic extraction of features from a DNA chain. However, a universal architecture that could generalize to several organisms has not yet been achieved, and thus, requiring researchers to seek new architectures and hyperparameters for each new organism evaluated. In this work, we propose a versatile architecture, based on capsule network, that can accurately identify promoter sequences in raw DNA data from seven different organisms, eukaryotic, and prokaryotic. Our model, the CapsProm, could assist in the transfer of learning between organisms and expand its applicability. Furthermore the CapsProm showed competitive results, overcoming the baseline method in five out of seven of the tested datasets (F1-score). The models and source code are made available at https://github.com/lauromoraes/CapsNet-promoter.
翻译:在生物信息学领域,将推广者区域定位为DNA序列至关重要。这是文献中广泛研究的一个问题,但还没有完全解决。一些研究人员利用变异网络展示了显著的成果,从而可以自动从DNA链中提取特征。然而,尚未实现一个可推广到若干生物体的普遍结构,因此,要求研究人员为每个被评估的新生物体寻找新的结构和超参数。在这项工作中,我们提议了一个基于胶囊网络的多功能结构,能够准确识别七个不同生物体,即水晶学和 prokaryaty的原始DNA数据中的促进者序列。我们的模型,即CapsProm,可以协助生物体之间的学习转移并扩大其应用性。此外,CapsProm展示了竞争结果,在经过测试的数据集(F1-score)中,有七个数据集(F1-score)中,有五个完成了基线方法。模型和源代码见https://github.com/lauroraes/CapsNet-Plalerr。