The ability to automatically identify industry sector coverage in articles on legal developments, or any kind of news articles for that matter, can bring plentiful of benefits both to the readers and the content creators themselves. By having articles tagged based on industry coverage, readers from all around the world would be able to get to legal news that are specific to their region and professional industry. Simultaneously, writers would benefit from understanding which industries potentially lack coverage or which industries readers are currently mostly interested in and thus, they would focus their writing efforts towards more inclusive and relevant legal news coverage. In this paper, a Machine Learning-powered industry analysis approach which combined Natural Language Processing (NLP) with Statistical and Machine Learning (ML) techniques was investigated. A dataset consisting of over 1,700 annotated legal articles was created for the identification of six industry sectors. Text and legal based features were extracted from the text. Both traditional ML methods (e.g. gradient boosting machine algorithms, and decision-tree based algorithms) and deep neural network (e.g. transformer models) were applied for performance comparison of predictive models. The system achieved promising results with area under the receiver operating characteristic curve scores above 0.90 and F-scores above 0.81 with respect to the six industry sectors. The experimental results show that the suggested automated industry analysis which employs ML techniques allows the processing of large collections of text data in an easy, efficient, and scalable way. Traditional ML methods perform better than deep neural networks when only a small and domain-specific training data is available for the study.
翻译:在法律发展的文章中自动确定工业部门覆盖面的能力,或任何类型的有关法律发展的文章中自动确定工业部门覆盖面的能力,可以给读者和内容创作者本身带来大量好处。通过根据行业覆盖面给文章贴上标签,世界各地的读者将能够获得与其区域和专业行业具体相关的法律新闻。与此同时,作者将得益于了解哪些行业可能缺乏覆盖面,或者哪个行业读者目前最感兴趣的是哪个行业,因此,他们将把写作努力的重点放在更具包容性和相关的法律新闻报道上。在本文中,机器学习动力行业分析方法将自然语言处理(NLP)与统计和机器学习(ML)技术结合起来,可以给读者带来大量好处。调查了一套由1 700多篇附加说明的法律文章组成的数据集,用于识别六个行业,从文本中提取了基于法规的特征。传统ML方法(例如梯度推动机器算法和基于决策树木的算法)和深度神经网络(例如变压模型)都用于对预测模型进行业绩比较。在接收者网络下取得了很有希望的结果,在0.80以上可操作的磁标度曲线分析中,而磁标的大规模数据采集分析则是显示0.80的磁标的行业的大规模数据采集的系统,显示了0.8和磁标的系统显示显示的磁标的大型分析。</s>