Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to data generated by topic models, self-supervised learning can be oblivious to the specific model, and hence is less susceptible to model misspecification. In particular, we prove that commonly used self-supervised objectives based on reconstruction or contrastive samples can both recover useful posterior information for general topic models. Empirically, we show that the same objectives can perform on par with posterior inference using the correct model, while outperforming posterior inference using misspecified models.
翻译:自我监督的学习极大地改善了许多国家学习计划任务的绩效。 但是,自我监督的学习如何发现有用的表现,为什么比常规方法(如概率模型)好得多呢? 在本文中,我们侧重于主题建模的背景,强调自我监督学习的主要优势 — — 当应用到专题模型生成的数据时,自我监督学习可以忽略特定模型,因此更容易被模型误判。 特别是,我们证明基于重建或对比样本的常用自我监督目标可以同时恢复一般主题模型的实用远地点信息。 我们生动地表明,同样的目标可以与使用正确模型的后方推论相同,而使用错误描述模型的后演推论则比前演得更好。</s>