Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is utilized in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction's weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, $k$-means clustering, and $k$-nearest neighbor classification. As a by-product, we propose a simple spectral method for self-supervised metric learning, which is computationally efficient and minimax optimal for estimating target distance. Finally, numerical experiments are presented to support the theoretical results in the paper.
翻译:自我监督的衡量学习是学习远离未贴标签的数据集的成功方法。 由此得出的距离对于改进各种基于远程的下游任务大有帮助, 即便在衡量学习阶段没有利用下游任务的信息。 为了深入了解这一方法, 我们开发了一个统计框架, 从理论上研究自我监督的衡量学习如何在多视图数据背景下有益于下游任务。 在这个框架内, 我们显示, 衡量学习的目标距离满足了下游任务的若干预期属性。 另一方面, 我们的调查表明, 调整每个方向的重量可以进一步改进目标距离。 此外, 我们的分析准确地描述了在四种常用的下游任务上自我监督的衡量学习的改进特征: 样本识别、 两类抽样测试、 美元- 平均值组合和 美元- 美元- 近邻分类。 作为副产品, 我们建议了一种简单光谱方法, 用于自我监督的计量学习, 这是一种计算高效和最优化的目标距离估算。 最后, 数字实验将支持文件中的理论结果。