Cross-domain text classification aims to adapt models to a target domain that lacks labeled data. It leverages or reuses rich labeled data from the different but related source domain(s) and unlabeled data from the target domain. To this end, previous work focuses on either extracting domain-invariant features or task-agnostic features, ignoring domain-aware features that may be present in the target domain and could be useful for the downstream task. In this paper, we propose a two-stage framework for cross-domain text classification. In the first stage, we finetune the model with mask language modeling (MLM) and labeled data from the source domain. In the second stage, we further fine-tune the model with self-supervised distillation (SSD) and unlabeled data from the target domain. We evaluate its performance on a public cross-domain text classification benchmark and the experiment results show that our method achieves new state-of-the-art results for both single-source domain adaptations (94.17% $\uparrow$1.03%) and multi-source domain adaptations (95.09% $\uparrow$1.34%).
翻译:跨领域文本分类旨在将模型调整到缺失标注数据的目标领域。它利用或重用来自不同但相关源领域的丰富标注数据和目标领域的未标注数据。为此,先前的工作集中于提取领域不变特征或任务不可知特征,忽略了目标领域中可能存在的领域感知特征,这些特征对下游任务可能有用。本文提出了一种针对跨领域文本分类的两阶段框架。在第一阶段中,我们利用掩码语言建模(MLM)和源域标记数据微调模型。在第二阶段中,我们进一步使用自监督蒸馏(SSD)和目标域的未标注数据微调模型。我们在公共跨领域文本分类基准测试中评估其性能,实验结果表明,我们的方法在单源域适应(94.17% $\uparrow$1.03%)和多源域适应(95.09% $\uparrow$1.34%)方面均取得了新的最优结果。