损耗压缩器通过扩展BWT保留变异调用 (Lossy Compressor preserving variant calling through Extended BWT) - 专知论文

会员服务 ·

0

有损压缩 · 得分 · 存储 · 高通量 · 存储器 ·

2023 年 4 月 17 日

Lossy Compressor preserving variant calling through Extended BWT

翻译：损耗压缩器通过扩展BWT保留变异调用

Veronica Guerrini,Felipe A. Louza,Giovanna Rosone

from arxiv, Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies

A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many specialized compressors that exploit redundancy in FASTQ data with the focus only on either the bases or the quality scores components. In this paper we consider the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FASTQ. We introduce a general strategy, based on the Extended Burrows-Wheeler Transform (EBWT) and positional clustering, and we present implementations in both internal memory and external memory. Experimental results show that the lossy compression performed by our tool is able to achieve good compression while preserving information relating to variant calling more than the competitors. Availability: the software is freely available at https://github.com/veronicaguerrini/BFQzip.

翻译：摘要：用于存储高通量测序实验输出的标准格式是FASTQ格式。它包括三个主要组件：（i）标题，（ii）碱基（核酸序列）和（iii）质量得分。FASTQ文件被广泛用于变异调用，其中测序数据被映射到参考基因组中以发现可用于进一步分析的变异体。许多专业压缩器利用FASTQ数据中的冗余，只关注碱基或质量得分组件。在本文中，我们考虑了一种新颖的问题，即在不使用参考文件的情况下通过同时修改两个组件来对FASTQ数据进行有损压缩，同时保留原始FASTQ的重要信息。我们提出了一种基于扩展Burrows-Wheeler变换（EBWT）和位置聚类的通用策略，并分别在内存和外部存储器中进行了实现。实验结果表明，我们的工具进行的有损压缩能够实现较好的压缩，同时比竞争对手更好地保留与变异调用相关的信息。可用性：该软件可在https://github.com/veronicaguerrini/BFQzip免费获得。

0

相关内容

有损压缩

DeepD2V:用于从组合DNA序列中预测转录因子结合位点的深度学习框架

DeepD2V:用于从组合DNA序列中预测转录因子结合位点的深度学习框架

专知会员服务

4+阅读 · 2022年12月5日

用于药物发现的抗体表征学习

用于药物发现的抗体表征学习

专知会员服务

10+阅读 · 2022年10月31日

Genome Biology | DeepRepeat: 对纳米孔测序信号数据的短串联重复进行直接的量化分析

Genome Biology | DeepRepeat: 对纳米孔测序信号数据的短串联重复进行直接的量化分析

专知会员服务

3+阅读 · 2022年10月9日

【NeurIPS2021】ResT:一个有效的视觉识别转换器

【NeurIPS2021】ResT:一个有效的视觉识别转换器

专知会员服务

23+阅读 · 2021年10月25日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

MIT博士论文 | 图指导的预测（含GNN的泛化能力和表示能力分析）

MIT博士论文 | 图指导的预测（含GNN的泛化能力和表示能力分析）

图与推荐

0+阅读 · 2022年11月14日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

样本贡献不均：Focal Loss和 Gradient Harmonizing Mechanism

样本贡献不均：Focal Loss和 Gradient Harmonizing Mechanism

极市平台

25+阅读 · 2019年4月25日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【泡泡一分钟】RoomNet：端到端房屋布局估计

【泡泡一分钟】RoomNet：端到端房屋布局估计

泡泡机器人SLAM

18+阅读 · 2018年12月4日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新七篇图像检索相关论文—草图、Tie-Aware、场景图解析、叠加跨注意力机制、深度哈希、人群估计

【论文推荐】最新七篇图像检索相关论文—草图、Tie-Aware、场景图解析、叠加跨注意力机制、深度哈希、人群估计

专知

10+阅读 · 2018年4月22日

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

专知

17+阅读 · 2018年2月11日

基于云计算平台的下一代测序数据错误修正算法研究与实现

国家自然科学基金

2+阅读 · 2015年12月31日

高血压患者Corin基因变异对其蛋白结构及酶功能影响的研究

国家自然科学基金

0+阅读 · 2015年12月31日

长链非编码RNA CAR intergenic 10在细胞衰老中的作用和机制

国家自然科学基金

1+阅读 · 2013年12月31日

马疱疹病毒1型(EHV-1)神经致病因子UL24转录调控分子机制的研究

国家自然科学基金

0+阅读 · 2013年12月31日

颗粒增强金属基复合材料搅拌摩擦焊残余应力的多尺度模拟

国家自然科学基金

0+阅读 · 2013年12月31日

atp7b基因外显子编码区变异致mRNA异常剪接的致病机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

通过5S rDNA基因来分析新疆小麦类基因组关系

国家自然科学基金

0+阅读 · 2012年12月31日

亚洲人群的新基因和剪接外显子的发现- 通过分析和验证HapMap其他人群的转录组测序(RNA-seq)数据

国家自然科学基金

0+阅读 · 2011年12月31日

全基因组关联分析搜寻斑秃的易感基因

国家自然科学基金

0+阅读 · 2011年12月31日

基于directionlets变换的SAR图像相干斑噪声抑制算法研究

国家自然科学基金

0+阅读 · 2008年12月31日

ODIN: Overcoming Dynamic Interference in iNference pipelines

Arxiv

0+阅读 · 2023年6月2日

Model-Free Error Assessment for Breadth-First Studies, with Applications to Cell-Perturbation Experiments

Arxiv

0+阅读 · 2023年6月2日

Numerical verification of the convexification method for a frequency-dependent inverse scattering problem with experimental data

Arxiv

0+阅读 · 2023年6月1日

ITR: A grammar-based graph compressor supporting fast neighborhood queries

Arxiv

0+阅读 · 2023年6月1日

Byzantine-Robust Clustered Federated Learning

Arxiv

0+阅读 · 2023年6月1日

Graph Clustering with Graph Neural Networks

Arxiv

0+阅读 · 2023年6月1日

A high order discontinuous Galerkin method for the recovery of the conductivity in Electrical Impedance Tomography

Arxiv

0+阅读 · 2023年5月31日

Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts

Arxiv

0+阅读 · 2023年5月31日

Federated Causal Inference in Heterogeneous Observational Data

Arxiv

24+阅读 · 2021年8月10日

Reasoning in Dialog: Improving Response Generation by Context Reading Comprehension

Arxiv

12+阅读 · 2020年12月14日

VIP会员

文章信息

相关主题

相关VIP内容

DeepD2V:用于从组合DNA序列中预测转录因子结合位点的深度学习框架

DeepD2V:用于从组合DNA序列中预测转录因子结合位点的深度学习框架

专知会员服务

4+阅读 · 2022年12月5日

用于药物发现的抗体表征学习

用于药物发现的抗体表征学习

专知会员服务

10+阅读 · 2022年10月31日

Genome Biology | DeepRepeat: 对纳米孔测序信号数据的短串联重复进行直接的量化分析

Genome Biology | DeepRepeat: 对纳米孔测序信号数据的短串联重复进行直接的量化分析

专知会员服务

3+阅读 · 2022年10月9日

【NeurIPS2021】ResT:一个有效的视觉识别转换器

【NeurIPS2021】ResT:一个有效的视觉识别转换器

专知会员服务

23+阅读 · 2021年10月25日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

MIT博士论文 | 图指导的预测（含GNN的泛化能力和表示能力分析）

MIT博士论文 | 图指导的预测（含GNN的泛化能力和表示能力分析）

图与推荐

0+阅读 · 2022年11月14日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

样本贡献不均：Focal Loss和 Gradient Harmonizing Mechanism

样本贡献不均：Focal Loss和 Gradient Harmonizing Mechanism

极市平台

25+阅读 · 2019年4月25日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【泡泡一分钟】RoomNet：端到端房屋布局估计

【泡泡一分钟】RoomNet：端到端房屋布局估计

泡泡机器人SLAM

18+阅读 · 2018年12月4日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

【论文推荐】最新七篇图像检索相关论文—草图、Tie-Aware、场景图解析、叠加跨注意力机制、深度哈希、人群估计

【论文推荐】最新七篇图像检索相关论文—草图、Tie-Aware、场景图解析、叠加跨注意力机制、深度哈希、人群估计

专知

10+阅读 · 2018年4月22日

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

专知

17+阅读 · 2018年2月11日

相关论文

ODIN: Overcoming Dynamic Interference in iNference pipelines

Arxiv

0+阅读 · 2023年6月2日

Model-Free Error Assessment for Breadth-First Studies, with Applications to Cell-Perturbation Experiments

Arxiv

0+阅读 · 2023年6月2日

Numerical verification of the convexification method for a frequency-dependent inverse scattering problem with experimental data

Arxiv

0+阅读 · 2023年6月1日

ITR: A grammar-based graph compressor supporting fast neighborhood queries

Arxiv

0+阅读 · 2023年6月1日

Byzantine-Robust Clustered Federated Learning

Arxiv

0+阅读 · 2023年6月1日

Graph Clustering with Graph Neural Networks

Arxiv

0+阅读 · 2023年6月1日

A high order discontinuous Galerkin method for the recovery of the conductivity in Electrical Impedance Tomography

Arxiv

0+阅读 · 2023年5月31日

Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts

Arxiv

0+阅读 · 2023年5月31日

Federated Causal Inference in Heterogeneous Observational Data

Arxiv

24+阅读 · 2021年8月10日

Reasoning in Dialog: Improving Response Generation by Context Reading Comprehension

Arxiv

12+阅读 · 2020年12月14日

相关基金

基于云计算平台的下一代测序数据错误修正算法研究与实现

国家自然科学基金

2+阅读 · 2015年12月31日

高血压患者Corin基因变异对其蛋白结构及酶功能影响的研究

国家自然科学基金

0+阅读 · 2015年12月31日

长链非编码RNA CAR intergenic 10在细胞衰老中的作用和机制

国家自然科学基金

1+阅读 · 2013年12月31日

马疱疹病毒1型(EHV-1)神经致病因子UL24转录调控分子机制的研究

国家自然科学基金

0+阅读 · 2013年12月31日

颗粒增强金属基复合材料搅拌摩擦焊残余应力的多尺度模拟

国家自然科学基金

0+阅读 · 2013年12月31日

atp7b基因外显子编码区变异致mRNA异常剪接的致病机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

通过5S rDNA基因来分析新疆小麦类基因组关系

国家自然科学基金

0+阅读 · 2012年12月31日

亚洲人群的新基因和剪接外显子的发现- 通过分析和验证HapMap其他人群的转录组测序(RNA-seq)数据

国家自然科学基金

0+阅读 · 2011年12月31日

全基因组关联分析搜寻斑秃的易感基因

国家自然科学基金

0+阅读 · 2011年12月31日

基于directionlets变换的SAR图像相干斑噪声抑制算法研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员