Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components which means that it is fast, has a rich ecosystem of NLP applications and extensions, comes with extensive documentation and a well-known API. Besides the overview of the underlying models, we also present rigorous evaluation on common benchmark datasets. Our experiments confirm that HuSpaCy has high accuracy in all subtasks while maintaining resource-efficient prediction capabilities.
翻译:虽然匈牙利有几条开放源码语言处理管道,但没有一条符合当今NLP应用程序的要求。语言处理管道应该包括近于最先进的利玛化、形态学分析、实体识别和文字嵌入。工业文本处理应用程序必须满足不起作用的软件质量要求,更重要的是,支持多种语言的框架越来越有利。本文介绍HuspaCy,一种工业化的匈牙利语言处理管道。所提供的工具为最重要的基本语言分析任务提供组件。它是开放源,有许可许可证。我们的系统建立在SpaCy NLP的部件上,这意味着它速度快,拥有丰富的NLP应用程序和扩展生态系统,具有广泛的文件和众所周知的API。除了对基本模型的概述外,我们还对通用基准数据集进行严格的评价。我们的实验证实,HuspaCy在保持资源效率预测能力的同时,在所有子任务中具有很高的精度。