Structured (tabular) data in the preclinical and clinical domains contains valuable information about individuals and an efficient table-to-text summarization system can drastically reduce manual efforts to condense this data into reports. However, in practice, the problem is heavily impeded by the data paucity, data sparsity and inability of the state-of-the-art natural language generation models (including T5, PEGASUS and GPT-Neo) to produce accurate and reliable outputs. In this paper, we propose a novel table-to-text approach and tackle these problems with a novel two-step architecture which is enhanced by auto-correction, copy mechanism and synthetic data augmentation. The study shows that the proposed approach selects salient biomedical entities and values from structured data with improved precision (up to 0.13 absolute increase) of copying the tabular values to generate coherent and accurate text for assay validation reports and toxicology reports. Moreover, we also demonstrate a light-weight adaptation of the proposed system to new datasets by fine-tuning with as little as 40\% training examples. The outputs of our model are validated by human experts in the Human-in-the-Loop scenario.
翻译:临床和临床领域结构化(图示)数据包含关于个人的宝贵信息,高效的表格-文本汇总系统可以大大减少人工将这些数据压缩成报告的工作,但在实践中,问题严重受到数据缺乏、数据偏狭以及最先进的自然语言生成模型(包括T5、PEGASUS和GPT-Neo)无法产生准确和可靠的产出等因素的严重阻碍。在本文件中,我们提出了一个新的表格-文本方法,并用通过自动校正、复制机制和合成数据增强而强化的新型两步结构解决这些问题。研究表明,拟议方法从结构化数据中选择了明显的生物医学实体和价值,复制表格值的精确度有所提高(达到0.13绝对增加),以便为鉴定报告和毒理学报告生成一致和准确的文本。此外,我们还展示了拟议系统对新数据集的轻量度调整,微调小,仅以40 ⁇ 培训实例为例。我们的模型产出得到了人类专家在“LOOO”情景中的验证。