The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets -- sets that cover a true label with a prescribed probability, independent of the underlying distribution -- using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments.
翻译:在大规模统计机学习中获取标签的费用使部分标签和标签不高的数据具有吸引力,尽管往往不明显如何利用这类数据进行模型安装或验证。我们提出了一个方法,以弥合部分监督和验证之间的差距,开发一个符合要求的预测框架,以提供有效的预测信任套套 -- -- 这套套装覆盖一个真实的标签,其规定概率与基本分布无关 -- -- 使用标签不高的数据。为了做到这一点,我们引入了一个(必要的)新的覆盖和预测有效性概念,然后制定几种应用设想方案,为分类提供有效的算法和若干大规模结构化预测问题。我们证实了这样的假设,即新的覆盖定义允许通过若干实验建立更加严格和更加信息化(但有效)的信心套装。