Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios -- well-specified, misspecified, and corrected models -- to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.
翻译:用于现代机器学习的标签数据成本昂贵且耗时。 隐性变量模型可以用来从使用未贴标签的数据的较弱、较容易到较容易获得的来源中推断标签。 这些模型也可以使用标签数据进行培训, 提出一个关键问题 : 如果用户投资于少数标签或许多未贴标签的点数? 我们通过一个框架来回答这个问题, 该框架以移动方法潜在变量估计中的模型错误区分为核心。 我们的核心结果是对通用错误的偏差偏差偏差分, 这表明未贴标签的方法在错误的区分下引起更多的偏差。 我们随后引入一个更正, 在某些情况下可以明显地消除这种偏差。 我们用我们分解的框架适用于三种情景 -- -- 精心设计、错误描述和修正的模式 -- -- 以1 选择标签数据和未贴标签的数据, 2 从它们的组合中学习。 我们用理论和合成实验观察, 对于定义良好的模型, 贴标签的点值比未贴标签的点值比未贴标签的点数更值一个不变的因素。 然而, 错误的区分, 它们的相对值会更高, 是因为额外的偏差, 但是, 我们的计算方法会降低我们的数据 。