We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the {\it hardness} of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.
翻译:我们研究从二进制用户反馈中分组一组项目的问题。 这个问题出现在解决大型标签任务的众包平台中, 用户只付出微小努力, 解决大型标签任务的众包平台中。 例如, 在最近的一些 CAPTCHA 系统中, 用户点击( 二进制答案) 可用于高效标签图像。 在我们的推论问题中, 将项目分组分组归为初始未知的非重叠分组。 为了恢复这些分组, 学习者会依次向用户提交一个有限的项目清单, 以及一个从固定的有限数据集中选择的二进制答案。 对于其中的每一个项目, 用户会给出一个响亮的答案, 其期待是由项目组群和问题决定的, 以及问题和项目分类的某个特定项目具体参数, 显示 显示 ~ 硬硬硬硬硬硬度 选项, 以及 以不精确度的算法 排序 。 我们通过适应性选择一个简单的算法, 将一个简单的算法, 排序到一个更精确的算法 。