Rapid progress in natural language processing has led to its utilization in a variety of industrial and enterprise settings, including in its use for information extraction, specifically named entity recognition and relation extraction, from documents such as engineering manuals and field maintenance reports. While named entity recognition is a well-studied problem, existing state-of-the-art approaches require large labelled datasets which are hard to acquire for sensitive data such as maintenance records. Further, industrial domain experts tend to distrust results from black box machine learning models, especially when the extracted information is used in downstream predictive maintenance analytics. We overcome these challenges by developing three approaches built on the foundation of domain expert knowledge captured in dictionaries and ontologies. We develop a syntactic and semantic rules-based approach and an approach leveraging a pre-trained language model, fine-tuned for a question-answering task on top of our base dictionary lookup to extract entities of interest from maintenance records. We also develop a preliminary ontology to represent and capture the semantics of maintenance records. Our evaluations on a real-world aviation maintenance records dataset show promising results and help identify challenges specific to named entity recognition in the context of noisy industrial data.
翻译:自然语言处理的迅速进展导致自然语言处理在各种工业和企业环境中的利用,包括利用诸如工程手册和实地维护报告等文件的信息提取、具体指明实体的确认和关系提取,自然语言处理的迅速进展导致其在各种工业和企业环境中的利用,包括用于信息提取、具体指明实体的确认和关系提取。虽然名称实体的确认是一个研究周密的问题,但现有最先进的方法要求大量贴有标签的数据集,难以获得敏感数据,如维护记录等。此外,工业领域专家往往不信任黑盒机器学习模型的结果,特别是在下游预测维护分析中使用所提取的信息。我们克服了这些挑战,在词典和本学所收集的域专家知识的基础上制定了三种方法。我们开发了一种综合和语义学方法,利用预先培训的语言模型,对在基本词典外观外观上找到一个问题解答任务,以便从维护记录中提取感兴趣的实体。我们还开发了一种初步的理论,以体现和捕捉到维护记录的语义学。我们对现实世界航空维护记录进行评估后,显示有希望的结果,并有助于查明在紧张的工业数据中被命名实体识别的具体挑战。