To build a French national electronic injury surveillance system based on emergency room visits, we aim to develop a coding system to classify their causes from clinical notes in free-text. Supervised learning techniques have shown good results in this area but require a large amount of expert annotated dataset which is time consuming and costly to obtain. We hypothesize that the Natural Language Processing Transformer model incorporating a generative self-supervised pre-training step can significantly reduce the required number of annotated samples for supervised fine-tuning. In this preliminary study, we test our hypothesis in the simplified problem of predicting whether a visit is the consequence of a traumatic event or not from free-text clinical notes. Using fully re-trained GPT-2 models (without OpenAI pre-trained weights), we assess the gain of applying a self-supervised pre-training phase with unlabeled notes prior to the supervised learning task. Results show that the number of data required to achieve a ginve level of performance (AUC>0.95) was reduced by a factor of 10 when applying pre-training. Namely, for 16 times more data, the fully-supervised model achieved an improvement <1% in AUC. To conclude, it is possible to adapt a multi-purpose neural language model such as the GPT-2 to create a powerful tool for classification of free-text notes with only a small number of labeled samples.
翻译:为了建立一个基于急诊室访问的法国国家电子伤害监测系统,我们的目标是开发一个编码系统,用免费文本的临床笔记对其原因进行分类; 受监督的学习技术在这一领域取得了良好结果,但需要大量专家提供耗时且成本高昂的附加说明的数据集; 我们假设自然语言处理前变异器模式包含一种自监管的自监管培训前步骤,可以大大减少监督微调所需的附加说明的样本数量; 在这项初步研究中,我们测试了我们关于简化问题的假设,即预测访问是否是创伤事件的结果,而不是免费文本临床笔记。 使用经过全面再培训的GPT-2模型(没有OpenAI预先培训的重量),我们评估了在接受监督的学习任务之前采用自监管前未加标签的训练前阶段的收益。 研究结果显示,在应用前培训时,实现初级水平所需的数据数量(AUC>0.95)仅减少10倍。 也就是说,对于16倍以上的数据来说,完全强化的GPT-2模型将最终升级为G1的模型。