In this paper, we propose DEXTER, an end to end system to extract information from tables present in medical health documents, such as electronic health records (EHR) and explanation of benefits (EOB). DEXTER consists of four sub-system stages: i) table detection ii) table type classification iii) cell detection; and iv) cell content extraction. We propose a two-stage transfer learning-based approach using CDeC-Net architecture along with Non-Maximal suppression for table detection. We design a conventional computer vision-based approach for table type classification and cell detection using parameterized kernels based on image size for detecting rows and columns. Finally, we extract the text from the detected cells using pre-existing OCR engine Tessaract. To evaluate our system, we manually annotated a sample of the real-world medical dataset (referred to as Meddata) consisting of wide variations of documents (in terms of appearance) covering different table structures, such as bordered, partially bordered, borderless, or coloured tables. We experimentally show that DEXTER outperforms the commercially available Amazon Textract and Microsoft Azure Form Recognizer systems on the annotated real-world medical dataset
翻译:在本文中,我们提议DEXTER,这是从医疗健康文件,例如电子健康记录(EHR)和效益解释(EOB)等表格中提取信息的终端系统。DEXTER由四个子系统阶段组成:一)表格检测(二)表格类型分类(三)细胞检测(三)表型分类(三)细胞内容提取(四)细胞内容提取(我们提议使用CDC-Net结构以及非最大量的表格检测抑制,采取两阶段传输学习方法。我们设计了一种传统的基于计算机愿景的计算机常规方法,用于表格类型分类和细胞检测,使用基于图像尺寸的参数内核,用于检测行和列。最后,我们利用原存在的OCR引擎“Tessaract”从检测的细胞中提取文本。为了评估我们的系统,我们手动了一个真实世界医疗数据集样本(称为Meddata),由不同表格结构(如边框、部分边框、无边框或有色表)的文档构成的广泛变化构成。我们实验性地显示DEXTER超越了商业上可用的亚马星世界数据和微系统。