Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. They are however held back in some cases by the high cost of sampling. In the worst-case scenario, the sampling cost scales as $O(n^3)$ where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels. In such cases, the cost of standard sampling algorithms scales as $O(np^2 + nm^2)$ where m is the (average) number of samples of the DPP (usually $m \ll n$) and p ($m \leq p \leq n$) the rank of the kernel used to define the DPP. The first term, $O(np^2)$, comes from a SVD-like step. We focus here on the second term of this cost, $O(nm^2)$, and show that it can be brought down to $O(nm + m^3 log m)$ without loss on the sampling's exactness. In practice, we observe extremely substantial speedups compared to the classical algorithm as soon as $n > 1, 000$. The algorithm described here is a close variant of the standard algorithm for sampling continuous DPPs, and uses rejection sampling. In the specific case of projection DPPs, we also show that any additional sample can be drawn in time $O(m^3 log m)$. Finally, an interesting by-product of the analysis is that a realisation from a DPP is typically contained in a subset of size $O(m log m)$ formed using leverage score i.i.d. sampling.
翻译:discrete Dizminantal Point 进程(DPPs) 具有广泛的潜在潜在用途,用于子抽样数据集。 但是,在某些情况下,它们被取样成本高所抑制。 在最坏的情况下,取样成本规模为$O(n%3)美元,其中n是地面组数。 对这种令人望而却步的成本进行普遍调整, 抽样点为低层内核定义的DPP。 在这种情况下, 标准否定算法规模的成本为$O(np%2 + nm%2)$(m%2), 其中, 标准抽样算法(m)是( 平均) 美元( 美元) 的样本数量( 通常为$3 mll n美元) 和 p (m\ m\leq n美元) 。 用于定义 DPPP组数数数的取样成本等级。 第一个术语, $O(np2) 类似步骤。 我们在这里集中研究这一成本的第二期 $( $O (m) + n%2) 的标准算, 标准算算算算算算算得接近美元。