广义覆盖模型：结构、单调性与鸟枪法测序 (General Coverage Models: Structure, Monotonicity, and Shotgun Sequencing)

We study coverage processes in which each draw reveals a subset of $[n]$, and the goal is to determine the expected number of draws until all items are seen at least once. A classical example is the Coupon Collector's Problem, where each draw reveals exactly one item. Motivated by shotgun DNA sequencing, we introduce a model where each draw is a contiguous window of fixed length, in both cyclic and non-cyclic variants. We develop a unifying combinatorial tool that shifts the task of finding coverage time from probability, to a counting problem over families of subsets of $[n]$ that together contain all items, enabling exact calculation. Using this result, we obtain exact expressions for the window models. We then leverage past results on a continuous analogue of the cyclic window model to analyze the asymptotic behavior of both models. We further study what we call uniform $\ell$-regular models, where every draw has size $\ell$ and every item appears in the same number of admissible draws. We compare these to the batch sampling model, in which all $\ell$-subsets are drawn uniformly at random and present upper and lower bounds, which were also obtained independently by Berend and Sher. We conjecture, and prove for special cases, that this model maximizes the coverage time among all uniform $\ell$-regular models. Finally, we prove a universal upper bound on the entire class of uniform $\ell$-regular models, which illuminates the fact that many sampling models share the same leading asymptotic order, while potentially differing significantly in lower-order terms.

翻译：我们研究覆盖过程，其中每次抽样揭示集合$[n]$的一个子集，目标是确定所有元素至少出现一次所需的期望抽样次数。经典案例是优惠券收集问题，其中每次抽样恰好揭示一个元素。受鸟枪法DNA测序启发，我们引入一种模型，其中每次抽样为固定长度的连续窗口，包含循环与非循环两种变体。我们开发了一种统一的组合工具，将覆盖时间的计算任务从概率问题转化为对$[n]$子集族的计数问题，这些子集族共同包含所有元素，从而实现精确计算。利用该结果，我们获得了窗口模型的精确表达式。随后，我们借助循环窗口模型的连续类比已有结果，分析两种模型的渐近行为。我们进一步研究称为均匀$\ell$-正则模型，其中每次抽样规模为$\ell$且每个元素在可容许抽样中出现的次数相同。我们将这些模型与批量抽样模型进行比较，后者中所有$\ell$-子集均匀随机抽取，并给出上下界（该结果亦由Berend和Sher独立获得）。我们猜想并针对特殊情形证明，该模型在所有均匀$\ell$-正则模型中具有最大覆盖时间。最后，我们证明了均匀$\ell$-正则模型整个类别的普适上界，这揭示了众多抽样模型具有相同的主导渐近阶，而低阶项可能存在显著差异。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日