Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to cut low bits of context, proposed context binning optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a small number of states. Model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity. This article is work in progress, to be expanded in the future.
翻译:基因数据库的快速增长意味着从数据压缩的改进中节省大量资金,这需要更廉价的统计模型。 本条提议进行自动化优化,例如Markov类模型,特别是背景拆迁和模型群集。 虽然减少低环境比特很受欢迎,但拟议的背景拆迁优化了所提出的削减: 状态=bin[ctext] 确定概率分布, 从而将几乎所有有用的信息也从非常大的背景中提取到少数国家。 模型群集在一般统计模型空间中使用k- means群集, 从而可以优化几种模型(作为分类式机器人), 供每读都单独选择。 还简要讨论了一些适应性技术, 以包括数据非静止性。 本条正在进展中, 今后将予扩展。