Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for machines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information communicated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach MedFilter, which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where MedFilter is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances benefits downstream medical processing, achieving improvements of 15%, 105%, and 23% respectively for the extraction of symptoms, medications, and complaints.
翻译:从谈话数据中提取信息尤其具有挑战性,因为以任务为中心的对话性质使得人类能够有效地传递隐含信息,但对机器来说具有挑战性。挑战可能因演讲者在对话中的作用而有所不同,特别是当相关专门知识在各种角色之间分布不均时。此外,由于通过在对话中早些时候间接传递的信息来建立更加共享的背景,对话中也可能会遇到更多挑战。在本文中,我们建议采用新颖的模拟方法MedFilter,处理这些洞察力,以便提高识别和分类与任务相关的言论的性能,从而积极影响下游信息提取任务的业绩。我们评估了近7 000次医生-病人谈话的全套方法,在这些谈话中,MedFilter用来确定对讨论的医疗相关贡献(比SOTA基线在PR曲线下的地区改进了10% ), 确定任务相关言论有利于下游医疗处理,在提取症状、药物和投诉方面分别改进了15%、105%和23%。