确定基因组数据数据库业绩基准 (Benchmarking database performance for genomic data)

Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of >1000 datasets of transcription factor binding sites and histone marks, collected from previous publications, were reported and it was found that HNF4G significantly co-locates with cohesin subunit STAG1 (SA1).

翻译：基因组区域代表基因说明、转录要素约束站点和遗传变异等特征。进行各种基因组作业,例如确定重叠/非重叠区域或最近的基因说明是共同的研究需要。数据可以保存在便于管理的数据库系统中,但是,目前没有全面的数据库内内在算法来确定重叠区域。因此,我开发了一个区域图(RegMap)SQL基算法来进行基因组作业,并参照了不同数据库的性能。基准确定,PostgreSQL提取的重叠区域比 MySQL要快得多。在PostgreSQL中插入和数据上传一般搜索能力也比较好,尽管这两个数据库的总体搜索能力几乎相等。此外,利用对等算法,报告了从以前出版物中收集的超过1 000个转录要素连接站点和直方标记的重叠,并发现HNF4G与STAG1小分队(SA1)。