In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions. Our work is motivated by the application of clustering individuals according to their population of origin using markers, when the divergence between the two populations is small. We are interested in the case that individual features are of low average quality $\gamma$, and we want to use as few of them as possible to correctly partition the sample. We consider semidefinite relaxation of an integer quadratic program which is formulated essentially as finding the maximum cut on a graph where edge weights in the cut represent dissimilarity scores between two nodes based on their features. A small simulation result in Blum, Coja-Oghlan, Frieze and Zhou (2007, 2009) shows that even when the sample size $n$ is small, by increasing $p$ so that $np= \Omega(1/\gamma^2)$, one can classify a mixture of two product populations using the spectral method therein with success rate reaching an ``oracle'' curve. There the ``oracle'' was computed assuming that distributions were known, where success rate means the ratio between correctly classified individuals and the sample size $n$. In this work, we show the theoretical underpinning of this observed concentration of measure phenomenon in high dimensions, simultaneously for the semidefinite optimization goal and the spectral method, where the input is based on the gram matrix computed from centered data. We allow a full range of tradeoffs between the sample size and the number of features such that the product of these two is lower bounded by $1/{\gamma^2}$ so long as the number of features $p$ is lower bounded by $1/\gamma$.
翻译:在本文中, 我们考虑如何同时分割一个规模小的数据样本, 大小为$n的小型数据样本 。 我们的工作动力是, 当两个人群之间的差异小时, 使用标记, 将个人按其原籍人口使用分组。 我们感兴趣的是, 单个特征的平均质量低 $\gamma$, 我们想要尽可能少地使用它们来正确分割样本。 我们考虑一个整数四方块程序的半确定性松绑, 其制定方式主要是在一个图表上找到最大剪切值, 该图表中, 削减的边缘重量代表基于其特性的两个中间节点之间的异差值。 一个小模拟结果, 在Blum、 Coja- Oghlan、 Frieze 和 Zhou (2007, 2009) 显示, 即使样本规模小, 美元, 我们想要尽可能少地使用它们来修正样本。 $np=\ omega (1/\ gammama2) $, 我们可以用光谱方法将两个产品组的混合物分类, 其纯值等于 $ $ 美元 。 美元 的精度 的精度值 。 在此假设我们所知道的 的计算 的 的计算值 值 值 的 值 值 。