With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like G-zip, which fails to utilize the interrelation of nucleotide sequence. Recently, many researchers begin to investigate deep learning based gene data compression method. In this paper, we propose a transformer-based gene compression method named GeneFormer. Specifically, we first introduce a modified transformer structure to fully explore the nucleotide sequence dependency. Then, we propose fixed-length parallel grouping to accelerate the decoding speed of our autoregressive model. Experimental results on real-world datasets show that our method saves 29.7% bit rate compared with the state-of-the-art method, and the decoding speed is significantly faster than all existing learning-based gene compression methods.
翻译:随着基因测序技术的发展,基因数据有了爆炸性的增长。而基因数据的储存也成了一个重要的问题。传统的基因数据压缩方法依赖于G-zip等通用软件,而G-zip没有利用核糖核酸序列的相互关系。最近,许多研究人员开始调查以深层次学习为基础的基因数据压缩方法。在本论文中,我们提议了一种以变压器为基础的基因压缩方法,名为Gene Former。具体地说,我们首先引入了一种改良的变压器结构,以充分探索核酸序列依赖性。然后,我们提出了固定长度的平行组合,以加速我们自动递增模型的解码速度。现实世界数据集的实验结果显示,我们的方法比最先进的方法节省了29.7%的比特率,而且解码速度大大快于所有现有的以学习为基础的基因压缩方法。