Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.
翻译:自我监督的学习是一种日益流行的不受监督的学习方法,它实现了最先进的准确性结果。一种普遍的方法是在分类任务中将数据点和噪音点进行对比:这需要良好的噪音分布,而这种分布很难说。虽然缺少一种全面的理论,但人们普遍认为,最佳噪音分布实际上应该等同于数据分布,如General Aversarial Network(GANs)中那样。我们在这里从经验上和理论上对这一假设提出了挑战。我们转而采用噪音-稳定估计法(NCE),将这种自我监督的任务作为基于能源的数据模型的估计问题:这需要一种良好的噪音分布方式,而这种分配方式的优化是难以说明的;尽管缺乏全面理论,但人们普遍认为,最佳的噪音分布方式应该等同于数据分布方式,在最优的状态上,在最优程度上,最优的分布方式也是最优的,在最优程度上,最优的分布方式也是最优的,在最优程度上,以最优的密度为最低的频率为最低的,在最优的分布方式上,最不为最不为最不相同的分配方式,以最不相同的分配方式,在最接近的状态上,以最不相同的分配方式,以最不相同的分配方式,以最难的频率进行最难的状态进行最难的状态进行最难的计算。