In recent years, pretrained language models have revolutionized the NLP world, while achieving state of the art performance in various downstream tasks. However, in many cases, these models do not perform well when labeled data is scarce and the model is expected to perform in the zero or few shot setting. Recently, several works have shown that continual pretraining or performing a second phase of pretraining (inter-training) which is better aligned with the downstream task, can lead to improved results, especially in the scarce data setting. Here, we propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data, which in turn can be used to adapt language models for sentiment analysis. Extensive experimental results show the value of our approach on various benchmark datasets, including the finance domain. Code, models and data are available at https://github.com/ibm/tslm-discourse-markers.
翻译:近年来,预先培训的语言模式使国家语言平台世界发生了革命性的变化,同时在各种下游任务中取得了最新业绩,然而,在许多情况下,当标签数据稀少,模型预期在零或很少的场景中运行时,这些模式效果不佳。最近,一些工作表明,持续培训前或进行第二阶段培训前(内部培训),与下游任务更加一致,能够带来更好的结果,特别是在稀缺的数据设置方面。在这里,我们提议利用情感载体话语标记来生成大规模微弱的标签数据,而这些数据又可用来将语言模型用于情绪分析。广泛的实验结果显示了我们各种基准数据集方法的价值,包括金融领域。代码、模型和数据见https://github.com/ibm/tslm-discour-margers。