For many learning problems one may not have access to fine grained label information; e.g., an image can be labeled as husky, dog, or even animal depending on the expertise of the annotator. In this work, we formalize these settings and study the problem of learning from such coarse data. Instead of observing the actual labels from a set $\mathcal{Z}$, we observe coarse labels corresponding to a partition of $\mathcal{Z}$ (or a mixture of partitions). Our main algorithmic result is that essentially any problem learnable from fine grained labels can also be learned efficiently when the coarse data are sufficiently informative. We obtain our result through a generic reduction for answering Statistical Queries (SQ) over fine grained labels given only coarse labels. The number of coarse labels required depends polynomially on the information distortion due to coarsening and the number of fine labels $|\mathcal{Z}|$. We also investigate the case of (infinitely many) real valued labels focusing on a central problem in censored and truncated statistics: Gaussian mean estimation from coarse data. We provide an efficient algorithm when the sets in the partition are convex and establish that the problem is NP-hard even for very simple non-convex sets.
翻译:对于许多学习问题,人们可能无法获得精细的标签信息;例如,根据批注员的专业知识,一个图像可以被贴上“husky”、“狗”甚至动物的标签。在这项工作中,我们将这些设置正式化,并研究从这种粗糙数据学习的问题。我们没有用一套$\mathcal ⁇ $来观察实际标签,而是观察相当于$\mathcal ⁇ $(或分区的混合)分割的粗粗糙标签。我们的主要算法结果是,如果粗糙数据有足够的信息,那么从精细的标签中学习的任何问题基本上都可以被贴上“husky”、“狗”或“动物”的标签。我们通过在只给粗粗粗粗的标签回答“SQ”(SQ)时的通用减少回答“SQ(SQ)”得出我们的结果。我们所需要的粗粗粗的标签数量取决于由于粗糙而导致的信息扭曲,以及美标签的金额 $ ⁇ mathcal $。我们还调查了(绝对多的)真正有价值的标签案例,在检查和粗粗略的分类中集中的问题中,我们从一个不精确的拼略的拼略的统计中可以提供有效的数据。