There is growing interest in the automated extraction of relevant information from clinical dialogues. However, it is difficult to collect and construct large annotated resources for clinical dialogue tasks. Recent developments in natural language processing suggest that large-scale pre-trained language backbones could be leveraged for such machine comprehension and information extraction tasks. Yet, due to the gap between pre-training and downstream clinical domains, it remains challenging to exploit the generic backbones for domain-specific applications. Therefore, in this work, we propose a domain-specific language pre-training, to improve performance on downstream tasks like dialogue comprehension. Aside from the common token-level masking pre-training method, according to the nature of human conversations and interactive flow of multi-topic inquiry-answering dialogues, we further propose sample generation strategies with speaker and utterance manipulation. The conversational pre-training guides the language backbone to reconstruct the utterances coherently based on the remaining context, thus bridging the gap between general and specific domains. Experiments are conducted on a clinical conversation dataset for symptom checking, where nurses inquire and discuss symptom information with patients. We empirically show that the neural model with our proposed approach brings improvement in the dialogue comprehension task, and can achieve favorable results in the low resource training scenario.
翻译:对从临床对话中自动提取相关信息的兴趣越来越大,然而,很难收集和为临床对话任务收集和建造大量附加说明的资源。自然语言处理方面的最新发展表明,可以利用大规模预先培训的语文骨干进行机器理解和信息提取任务;然而,由于培训前和下游临床领域之间的差距,利用通用骨干进行具体领域应用,仍然具有挑战性。因此,在这项工作中,我们提议针对特定领域的语言进行培训前培训,以改进诸如对话理解等下游任务的业绩。除了共同的象征性顶级培训前遮盖外,根据人类对话的性质以及多主题问答对话的互动流,我们进一步提出与演讲者和超语操纵的样本生成战略。对话前培训将指导语言骨干根据剩余环境连贯地重建全方位,从而缩小一般领域和具体领域之间的差距。实验是在临床谈话数据集上进行症状检查,让护士询问和与病人讨论症状信息。我们从经验上表明,我们提议的神经模型在低度培训模式中可以改进对话情景,实现有利的资源。