We consider the problem of identifying the subset $\mathcal{S}^{\gamma}_{\mathcal{P}}$ of elements in the support of an underlying distribution $\mathcal{P}$ whose probability value is larger than a given threshold $\gamma$, by actively querying an oracle to gain information about a sequence $X_1, X_2, \ldots$ of $i.i.d.$ samples drawn from $\mathcal{P}$. We consider two query models: $(a)$ each query is an index $i$ and the oracle return the value $X_i$ and $(b)$ each query is a pair $(i,j)$ and the oracle gives a binary answer confirming if $X_i = X_j$ or not. For each of these query models, we design sequential estimation algorithms which at each round, either decide what query to send to the oracle depending on the entire history of responses or decide to stop and output an estimate of $\mathcal{S}^{\gamma}_{\mathcal{P}}$, which is required to be correct with some pre-specified large probability. We provide upper bounds on the query complexity of the algorithms for any distribution $\mathcal{P}$ and also derive lower bounds on the optimal query complexity under the two query models. We also consider noisy versions of the two query models and propose robust estimators which can effectively counter the noise in the oracle responses.
翻译:我们考虑如何确定一个子子集 $mathcal{S\\ gamma}mathcal{pcal{P}} 美元, 用于支持一个基本分布值$\mathcal{P} 美元, 其概率值大于给定阈值$\gamma$的子集 问题, 积极查询一个神器, 以获得关于一个序列 $X_ 1, X_ 2, 美元=ld. d. 美元 样本。 我们考虑两个查询模式 : $(a) 美元, 每份查询费用为1 美元, 每份查询费用为1 美元, 美元= gammacal{ 美元, 或 美元; 每个查询单位的数值为 1 美元, 美元; 我们根据答复的整个历史, 美元=macal=cal=cal_ 美元, 每个查询值为一对一对一对一对一对一对一对一对一, 美元, 美元, 美元, 美元, 而 Ocleacol 提供一个二回答答案 。 对于每个查询模型, 我们也可以在两个更精确的解算两个不同的计算。