What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve state of the art unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.
翻译:什么样的结构可以使学习者能够从未贴标签的数据中发现类? 传统方法取决于数据上的特性空间相似性和英雄假设 。 在本文中, 我们引入了在Leaent Label Shift (LLS) 下不受监督的学习。 我们可以从多个域获取未贴标签的数据, 这样标签的边际 $p_ d(y) 可以跨域移动, 但类的条件值是 $p (mathb{x}x} 美元, 但没有。 这项工作可以立即为识别类 : 组合的元素。 对于有限输入空间空间空间和英雄假设空间空间空间空间空间空间空间空间空间空间,我们建立了一种不受监督的模型: 输入文字、文档的域域和主题的标签。 解决持续的数据,我们证明当标签支持包含一个 separb 区域时, 或连接 $p( d ⁇ math{x{x} 域域域的模型可以确定 $p_(y) 和 rideal_ drimax) 的模型, 当(y) liver_ dexisal exx) exx 显示( wex) max) max rode dex) ax exx rodex 。