We present a neural semi-supervised learning model termed Self-Pretraining. Our model is inspired by the classic self-training algorithm. However, as opposed to self-training, Self-Pretraining is threshold-free, it can potentially update its belief about previously labeled documents, and can cope with the semantic drift problem. Self-Pretraining is iterative and consists of two classifiers. In each iteration, one classifier draws a random set of unlabeled documents and labels them. This set is used to initialize the second classifier, to be further trained by the set of labeled documents. The algorithm proceeds to the next iteration and the classifiers' roles are reversed. To improve the flow of information across the iterations and also to cope with the semantic drift problem, Self-Pretraining employs an iterative distillation process, transfers hypotheses across the iterations, utilizes a two-stage training model, uses an efficient learning rate schedule, and employs a pseudo-label transformation heuristic. We have evaluated our model in three publicly available social media datasets. Our experiments show that Self-Pretraining outperforms the existing state-of-the-art semi-supervised classifiers across multiple settings. Our code is available at https://github.com/p-karisani/self_pretraining.
翻译:我们提出了一个神经半监督的学习模式,称为自我准备。我们的模型是由经典自我培训算法启发的。然而,与自我培训相比,自我准备培训是没有门槛的,它有可能更新对先前标签的文件的信念,并能够应对语义漂流问题。自我准备培训是迭代的,由两个分类者组成。在每次迭代中,一个分类者抽出一组没有标签的文件并贴上标签。这套模型用于初始化第二个分类器,由标签文件组进一步培训。算法开始进入下一个迭代和叙级者的角色被颠倒。为了改善跨迭代的信息流动,并应对语义漂流问题,自我准备培训采用迭代的蒸馏过程,将假体转换成跨迭代的,使用两阶段培训模式,使用有效的学习进度表,并使用一个假标签变体。我们已经在三种公开的社会媒体数据集中评估了我们的模型。我们的实验显示,在多级/多级化的自我分析设置中,我们现有的自我分析系统。