Several deep learning architectures have been proposed over the last years to deal with the problem of generating a written report given an imaging exam as input. Most works evaluate the generated reports using standard Natural Language Processing (NLP) metrics (e.g. BLEU, ROUGE), reporting significant progress. In this article, we contrast this progress by comparing state of the art (SOTA) models against weak baselines. We show that simple and even naive approaches yield near SOTA performance on most traditional NLP metrics. We conclude that evaluation methods in this task should be further studied towards correctly measuring clinical accuracy, ideally involving physicians to contribute to this end.
翻译:在过去几年里,提出了若干深层次的学习结构,以解决编写书面报告的问题,作为成像测试的投入;大多数工作采用标准的自然语言处理标准(如BLEU、ROUGE)衡量标准(如BLEU、ROUGE),对生成的报告进行评价,报告取得的重大进展;在本条中,我们通过比较艺术(SOTA)模型和薄弱的基线来对比这一进展;我们表明,简单甚至天真的方法在SOTA最传统的NLP指标上产生接近SOTA的成绩;我们的结论是,应进一步研究这项工作的评价方法,以便正确测量临床准确性,最好是让医生参与,为达到这一目的作出贡献。