寻找准标识符的更好界限 (Towards Better Bounds for Finding Quasi-Identifiers)

We revisit the problem of finding small $\epsilon$-separation keys introduced by Motwani and Xu (2008). In this problem, the input is $m$-dimensional tuples $x_1,x_2,\ldots,x_n $. The goal is to find a small subset of coordinates that separates at least $(1-\epsilon){n \choose 2}$ pairs of tuples. They provided a fast algorithm that runs on $\Theta(m/\epsilon)$ tuples sampled uniformly at random. We show that the sample size can be improved to $\Theta(m/\sqrt{\epsilon})$. Our algorithm also enjoys a faster running time. To obtain this result, we provide upper and lower bounds on the sample size to solve the following decision problem. Given a subset of coordinates $A$, reject if $A$ separates fewer than $(1-\epsilon){n \choose 2}$ pairs, and accept if $A$ separates all pairs. The algorithm must be correct with probability at least $1-\delta$ for all $A$. We show that for algorithms based on sampling: - $\Theta(m/\sqrt{\epsilon})$ samples are sufficient and necessary so that $\delta \leq e^{-m}$ and - $\Omega(\sqrt{\frac{\log m}{\epsilon}})$ samples are necessary so that $\delta$ is a constant. Our analysis is based on a constrained version of the balls-into-bins problem. We believe our analysis may be of independent interest. We also study a related problem that asks for the following sketching algorithm: with given parameters $\alpha,k$ and $\epsilon$, the algorithm takes a subset of coordinates $A$ of size at most $k$ and returns an estimate of the number of unseparated pairs in $A$ up to a $(1\pm\epsilon)$ factor if it is at least $\alpha {n \choose 2}$. We show that even for constant $\alpha$ and success probability, such a sketching algorithm must use $\Omega(mk \log \epsilon^{-1})$ bits of space; on the other hand, uniform sampling yields a sketch of size $\Theta(\frac{mk \log m}{\alpha \epsilon^2})$ for this purpose.

翻译：我们重新审视了Motwani和Xu（2008）提出的寻找小$\epsilon$-分离关键字的问题。在这个问题中，输入是$m$维元组$x_1,x_2,\ldots,x_n$。目标是找到一小部分坐标，可以将至少$(1-\epsilon){n\choose2}$对元组分开。他们提供了一种快速算法，该算法在均匀随机抽取的$\Theta{(m/\epsilon)}$个元组上运行。我们表明，样本大小可以提高到$\Theta{(m/\sqrt{\epsilon})}$。我们的算法也具有更快的运行时间。为了获得这个结果，我们提供了上限和下限，使以下决策问题的样本大小得以解决。给定坐标集$A$，如果$A$分隔的组数少于$(1-\epsilon){n\choose2}$，则拒绝；如果$A$分隔所有对，则接受。对于所有A，算法必须以至少$1-\delta$的概率正确。我们表明，对于基于采样的算法： - 只需$\Theta{(m/\sqrt{\epsilon})}$个样本，以使$\delta\leq e^{-m}$且 - 至少需要$\Omega(\sqrt{\frac{\log m}{\epsilon}})$个样本，以使$\delta$是一个常数。我们的分析基于球放入箱的受限版本。我们认为我们的分析可能是独立的。我们还研究了一个相关的问题，要求以下草图算法：对于给定的参数$\alpha，k$和$\epsilon$，算法取$k$大小不超过的坐标集$A$，如果它至少是$\alpha{n\choose2}$，则返回$A$中未分开的对数的估计，误差率在$(1\pm\epsilon)$之内。我们表明，即使对于恒定的$\alpha$和成功概率，这样的草图算法也必须使用$\Omega(mk \log \epsilon^{-1})$位的空间;另一方面，均匀采样可以产生这样的草图，草图大小为$\Theta(\frac{mk \log m}{\alpha \epsilon^2})$。