Conversational AI systems can engage in unsafe behaviour when handling users' medical queries that can have severe consequences and could even lead to deaths. Systems therefore need to be capable of both recognising the seriousness of medical inputs and producing responses with appropriate levels of risk. We create a corpus of human written English language medical queries and the responses of different types of systems. We label these with both crowdsourced and expert annotations. While individual crowdworkers may be unreliable at grading the seriousness of the prompts, their aggregated labels tend to agree with professional opinion to a greater extent on identifying the medical queries and recognising the risk types posed by the responses. Results of classification experiments suggest that, while these tasks can be automated, caution should be exercised, as errors can potentially be very serious.
翻译:因此,系统必须既能认识到医疗投入的严肃性,又能以适当程度的风险做出回应。我们创建了一整套人工英文书面医学询问和不同类型系统的答复。我们用众包和专家说明来标注这些查询和答复。虽然个别人群工人在判断急症严重程度时可能不可靠,但其综合标签往往在更大程度上同意专业意见,即确定医疗询问和识别应对措施造成的风险类型。分类实验的结果表明,尽管这些任务可以自动化,但应谨慎行事,因为错误可能非常严重。