The use of Electronic Health Records (EHRs) has increased dramatically in the past 15 years, as, it is considered an important source of managing data od patients. The EHRs are primary sources of disease diagnosis and demographic data of patients worldwide. Therefore, the data can be utilized for secondary tasks such as research. This paper aims to make such data usable for research activities such as monitoring disease statistics for a specific population. As a result, the researchers can detect the disease causes for the behavior and lifestyle of the target group. One of the limitations of EHRs systems is that the data is not available in the standard format but in various forms. Therefore, it is required to first convert the names of the diseases and demographics data into one standardized form to make it usable for research activities. There is a large amount of EHRs available, and solving the standardizing issues requires some optimized techniques. We used a first-hand EHR dataset extracted from EHR systems. Our application uploads the dataset from the EHRs and converts it to the ICD-10 coding system to solve the standardization problem. So, we first apply the steps of pre-processing, annotation, and transforming the data to convert it into the standard form. The data pre-processing is applied to normalize demographic formats. In the annotation step, a machine learning model is used to recognize the diseases from the text. Furthermore, the transforming step converts the disease name to the ICD-10 coding format. The model was evaluated manually by comparing its performance in terms of disease recognition with an available dictionary-based system (MetaMap). The accuracy of the proposed machine learning model is 81%, that outperformed MetaMap accuracy of 67%. This paper contributed to system modelling for EHR data extraction to support research activities.
翻译:暂无翻译