We consider a query-based data acquisition problem for binary classification of unknown labels, which has diverse applications in communications, crowdsourcing, recommender systems and active learning. To ensure reliable recovery of unknown labels with as few number of queries as possible, we consider an effective query type that asks "group attribute" of a chosen subset of objects. In particular, we consider the problem of classifying $m$ binary labels with XOR queries that ask whether the number of objects having a given attribute in the chosen subset of size $d$ is even or odd. The subset size $d$, which we call query degree, can be varying over queries. We consider a general noise model where the accuracy of answers on queries changes depending both on the worker (the data provider) and query degree $d$. For this general model, we characterize the information-theoretic limit on the optimal number of queries to reliably recover $m$ labels in terms of a given combination of degree-$d$ queries and noise parameters. Further, we propose an efficient inference algorithm that achieves this limit even when the noise parameters are unknown.
翻译:我们考虑对未知标签进行二进制分类的基于查询的数据获取问题,这种分类在通信、众包、推荐系统以及积极学习方面有着不同的应用。为了确保可靠地恢复未知标签,并尽可能少地提出查询,我们考虑一种有效的查询类型,要求特定对象子集的“群属性”。我们特别考虑对二进制标签进行分类的问题,并使用XOR查询,询问在所选择的美元大小子集中具有特定属性的物体数量是偶数还是奇数。我们称之为查询度的子组规模$d,在查询时可以是不同的。我们考虑一般的噪音模式,根据工人(数据提供者)和查询度(美元)对查询的准确性进行查询。对于这一通用模式,我们用最理想的查询次数来说明信息理论限度,以便可靠地回收某个程度-美元查询和噪音参数的组合中的1百万美元标签。我们建议一种有效的推算法,即使不知道噪音参数,也能够达到这一限度。