Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. They are however held back in some cases by the high cost of sampling. In the worst-case scenario, the sampling cost scales as O(n^3) where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels. In such cases, the cost of standard sampling algorithms scales as O(np^2 + nm^2) where m is the (average) number of samples of the DPP (usually m << n) and p the rank of the kernel used to define the DPP (m \leq p \leq n). The first term, O(np^2), comes from a SVD-like step. We focus here on the second term of this cost, O(nm^2), and show that it can be brought down to O(nm + m^3 log m) without loss on the sampling's exactness. In practice, we observe very substantial speedups compared to the classical algorithm as soon as n > 1000. The algorithm described here is a close variant of the standard algorithm for sampling continuous DPPs, and uses rejection sampling. In the specific case of projection DPPs, we also show that any additional sample can be drawn in time O(m^3 log m). Finally, an interesting by-product of the analysis is that a realisation from a DPP is typically contained in a subset of size O(m log m) formed using leverage score i.i.d. sampling.
翻译:Discrete Dizminantal Point Programes (DPPs) 具有广泛的潜在潜在用途,用于子抽样数据集,但在某些情况下由于取样成本高而被搁置。在最坏的情况下,取样成本规模为O(n)3,其中n是地面组数。对于这种令人望而生畏的成本,一种普遍的办法是抽样DP(由低层内核定义)。在这种情况下,标准取样算法规模的成本,如O(n)3+ nm)2,其中M(平均)是DPP(通常为m)的样本数量(通常为m {% n) 和 p 用于定义 DPP(m) 的内核级(m leq p\leq n) 。在第一个术语,O(n) 2) 是SVD( m) 类似的步骤。我们在这里集中研究这一成本的第二个条件, O(n) 2, 并表明它可以从O(n) + m=3 log m) 中调低到 O(平均) 的(平均) 样本样本样本样本数量(m) 的(m=n) ralage) ligalevalalal) ligalalalationalationalation 。我们在这里观察算算算算算算算算算算算算算算算算出一个具体的精确到一个具体的数值。