Understanding the evolution of job requirements is becoming more important for workers, companies and public organizations to follow the fast transformation of the employment market. Fortunately, recent natural language processing (NLP) approaches allow for the development of methods to automatically extract information from job ads and recognize skills more precisely. However, these efficient approaches need a large amount of annotated data from the studied domain which is difficult to access, mainly due to intellectual property. This article proposes a new public dataset, FIJO, containing insurance job offers, including many soft skill annotations. To understand the potential of this dataset, we detail some characteristics and some limitations. Then, we present the results of skill detection algorithms using a named entity recognition approach and show that transformers-based models have good token-wise performances on this dataset. Lastly, we analyze some errors made by our best model to emphasize the difficulties that may arise when applying NLP approaches.
翻译:幸运的是,最近的自然语言处理(NLP)方法允许开发各种方法,以便从招聘广告中自动提取信息,更准确地认识技能。然而,这些高效方法需要从研究领域获得大量附加说明的数据,这些数据主要由于知识产权而难以获取。本条款建议建立一个新的公共数据集,即FIJO, 其中载有提供保险职位的报价,包括许多软技能说明。为了了解这一数据集的潜力,我们详细介绍了一些特点和一些限制。然后,我们用一个名称的实体识别方法介绍技能检测算法的结果,并表明以变异器为基础的模型在这一数据集上具有良好的象征性表现。最后,我们分析了我们的最佳模型中的一些错误,以强调在应用NLP方法时可能出现的困难。