Gathering training data is a key step of any supervised learning task, and it is both critical and expensive. Critical, because the quantity and quality of the training data has a high impact on the performance of the learned function. Expensive, because most practical cases rely on humans-in-the-loop to label the data. The process of determining the correct labels is much more expensive than comparing two items to see whether they belong to the same class. Thus motivated, we propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries, and the computer groups the items into classes (which can be labeled cheaply at the very end of the process). Given the items, we consider two random models for the classes: one where the set partition they form is drawn uniformly, the other one where each item chooses its class independently following a fixed distribution. In the first model, we characterize the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity. In the second model, we analyze a specific algorithm family, propose as a conjecture that they reach the minimum average number of queries and compare their performance to a random approach. We also propose solutions to handle errors or inconsistencies in the experts' answers.
翻译:收集培训数据是任何受监督的学习任务的关键一步,它既重要又昂贵。 关键是, 培训数据的数量和质量对学习功能的履行有很大影响。 费用昂贵, 因为大多数实际案例都依赖于在环中的人来给数据贴标签。 确定正确标签的过程比比较两个项目的成本要高得多。 因此, 我们提出一个培训数据收集的设置, 让人类专家执行相对廉价的对口问询任务, 以及计算机将项目分组为类( 可以在过程的末尾以廉价方式标出标签 ) 。 鉴于这两个项目, 我们考虑两种随机模式: 一个是它们构成的定置分区是统一的, 另一个是每个项目按照固定的分布独立选择其类别。 在第一个模型中, 我们描述将项目分类所需的平均查询次数减少到最小, 并分析其复杂性。 在第二个模型中, 我们分析一个具体的算法家庭, 提出一个解决方案, 以达到最低的平均查询次数, 或者将它们的表现与一个随机的方法进行比较。