Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman-Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.
翻译:在Bayesian随机分区模型中最近的进展导致制定和探索了可交换的集群序列模型。在ESC模型中,是可互换的集群大小,而不是观测本身。这一属性对于获得微型集群行为特别有用,即集群大小在观测数量上以亚线形式增长,记录连接、分散的网络和基因组等应用中常见。不幸的是,可交换的集群属性是以投影性为代价的。因此,与较传统的Drichlet进程或Pitman-Yor工艺混合模型相比,从ESC模型中先验的样本无法以顺序方式轻易地获得,而是需要使用拒绝或重要性取样。在这项工作中,利用ESC模型与离散更新理论之间的联系,我们获得了某些ESC模型的封闭式表达方式,并开发了比现有工艺状态更快的生成样本的方法。在此过程中,我们为ESC模型下的组群数的分布建立了分析表达方式,而在这项工作之前还不知道。