In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
翻译:在经过培训的语文模型时代,变异器是模型结构的实际选择。虽然最近的研究表明在完全革命性的或有线电视新闻网的结构中显示出了希望,但还没有利用培训前的节奏范式加以探索。在语言模型方面,在经过培训前,变异器对变异器具有竞争力?本文调查了这个研究问题并提出若干有趣的结论。在对8个数据集/任务进行的广泛试验中,我们发现有线电视新闻网的预先训练模型在某些情景中具有竞争力,并且在某些情景中优于变异器的对应方,尽管有警告。总体而言,本文概述的调查结果表明,培训前和建筑进步的叠加是误导的,两种进步都应该独立考虑。我们认为,我们的研究为替代建筑中健康的乐观度铺平了道路。