Despite recent advances in deep learning-based language modelling, many natural language processing (NLP) tasks in the financial domain remain challenging due to the paucity of appropriately labelled data. Other issues that can limit task performance are differences in word distribution between the general corpora - typically used to pre-train language models - and financial corpora, which often exhibit specialized language and symbology. Here, we investigate two approaches that may help to mitigate these issues. Firstly, we experiment with further language model pre-training using large amounts of in-domain data from business and financial news. We then apply augmentation approaches to increase the size of our dataset for model fine-tuning. We report our findings on an Environmental, Social and Governance (ESG) controversies dataset and demonstrate that both approaches are beneficial to accuracy in classification tasks.
翻译:尽管最近在深层次的基于学习的语言建模方面有所进展,但金融领域许多自然语言处理(NLP)任务仍面临挑战,因为缺乏适当标签的数据;其他可能限制任务绩效的问题包括一般公司(通常用于预先培训语言模式)与财务公司(往往展示专门语言和符号)之间的字数分布差异;我们在这里调查了两种可能有助于缓解这些问题的办法:首先,我们尝试进一步的语言示范培训前阶段,使用商业和金融新闻的大量内部数据;然后,我们采用扩大办法,增加我们数据库的规模,进行模型微调;我们报告关于环境、社会和治理争议数据集的调查结果,并表明这两种方法都有利于分类任务的准确性。