Job titles are the most fundamental building blocks for occupational data mining tasks, such as Career Modelling and Job Recommendation. However, there are no publicly available dataset to support such efforts. In this work, we present the Industrial and Professional Occupations Dataset (IPOD), which is a comprehensive corpus that consists of over 190,000 job titles crawled from over 56,000 profiles from Linkedin. To the best of our knowledge, IPOD is the first dataset released for industrial occupations mining. We use a knowledge-based approach for sequence tagging, creating a gazzetteer with domain-specific named entities tagged by 3 experts. All title NE tags are populated by the gazetteer using BIOES scheme. Finally, We develop 4 baseline models for the dataset on NER task with several models, including Linear Regression, CRF, LSTM and the state-of-the-art bi-directional LSTM-CRF. Both CRF and LSTM-CRF outperform human in both exact-match accuracy and f1 scores.
翻译:职称是职业数据采矿任务的最基本组成部分,如职业模型和工作建议。然而,没有公开的数据集支持这种努力。在这项工作中,我们展示了工业和专业职业数据集(IPOD),这是一个综合资料库,由来自Linkedin的56,000多份简介的190,000多份职称组成。据我们所知,IPOD是工业职业采矿的第一个数据集。我们使用基于知识的方法进行序列标记,建立了一个由3名专家标记的域名实体组成的格子。所有标题NE标记都由使用BIOES办法的地名录组成。最后,我们用若干模型,包括线性回归、通用报告格式、LSTM和最先进的双向双向LSTM-CRF。通用报告格式和LSTM-CRF在精确度和F1分中均高于人类。