We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, \cite{GSRK22} propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudo-labeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to an $\epsilon$-optimal predictor under a Gaussian model for any arbitrarily small $\epsilon$, while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation.
翻译:我们考虑一个模型需要适应分布变换下的新域的设置, 因为在测试时, 只有新域的未贴标签的测试样本才能进入。 大多数相关工程的共同点是为未贴标签的测试样品建造假标签, 并将梯度下移( GD) 应用到假标签的损失函数。 最近, 我们提出同义标签, 这是一种测试时自我训练的新型假标签。 它们实验性地显示, 等式标签比其他伪标签方式要优于许多域适应基准上的伪标签。 然而, 显而易见地显示, GD 与 conjugate 标签一起的GD 学习了一个很好的测试时间适应的分类器。 在这项工作中, 我们的目标是从理论上理解GD 和 硬和 conjugget 标签对于二进级分类问题。 我们显示, 对于方形损失, GD 和 conjugggate 标签在高标模型下比重$\eplon 等其他方法优于其他的假标签标签标签。 但是, 可以用GD 标签的GD 功能无法进行任意的更新, 并且在标签中用不同的硬值测试结果 更新。