We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning. This is typically an expensive and time consuming process and therefore, we seek ways to do so efficiently. The conventional approach involves comparing each sample with (the representative of) each class to find a match. In a setting with $N$ equally likely classes, this involves $N/2$ pairwise comparisons (queries per sample) on average. We consider a $k$-ary query scheme with $k\ge 2$ samples in a query that identifies (dis)similar items in the set while effectively exploiting the associated transitive relations. We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of $O(\frac{N}{k^2})$. In addition, we present an adaptive greedy query scheme, which achieves an average rate of $\approx 0.2N$ queries per sample with triplet queries. For the proposed algorithms, we investigate the query rate performance analytically and with simulations. Empirical studies suggest that each triplet query takes an expert at most 50\% more time compared with a pairwise query, indicating the effectiveness of the proposed $k$-ary query schemes. We generalize the analyses to nonuniform class distributions when possible.
翻译:我们考虑了在机器学习中标注数据集时询问专家或神器的基本问题。这通常是一个昂贵和耗时的过程,因此,我们想方设法高效率地这样做。常规方法涉及将每个样本与每类代表(代表)进行逐轮比较,以找到匹配。在同样可能性的类别中,我们考虑的是平均对等比较(每个样本的查询)费用为0.2美元。我们考虑的是用2美元的样本在查询中用2美元查询方案,在有效地利用相关的过境关系的同时,识别集中(不同)不同项目。我们提出了一个随机的批量算法。我们展示了一种随机的批量算法,每批量算法在逐轮基础上运行,以标注样品,并达到每类的查询率为0.2美元。此外,我们提出了一个适应性的贪婪查询方案,即每类查询平均达到0.2美元(每样本查询0.2美元)的查询率,并进行三度查询。关于拟议的算法,我们通过分析并进行模拟来调查。Empricalalcal 研究表明,每类查询每类查询一次专家在50-x次分析时,比可能进行的排序分析。