Machine learning research typically starts with a fixed data set created early in the process. The focus of the experiments is finding a model and training procedure that result in the best possible performance in terms of some selected evaluation metric. This paper explores how changes in a data set influence the measured performance of a model. Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance of a trained deep learning classifier. We assess the overall performance (weighted average) as well as the per-class performance. The observed effects are surprisingly pronounced, especially when the per-class performance is considered. We investigate how "semantic homogeneity" of a class, i.e., the proximity of sentences in a semantic embedding space, influences the difficulty of its classification. The presented results have far reaching implications for efforts related to data collection and curation in the field of AI & Law. The results also indicate that enhancements to a data set could be considered, alongside the advancement of the ML models, as an additional path for increasing classification performance on various tasks in AI & Law. Finally, we discuss the need for an established methodology to assess the potential effects of data set properties.
翻译:实验的重点是找到一个模型和培训程序,使某些选定的评价指标达到最佳性能。本文探讨了数据集的变化如何影响一个模型的测量性能。我们利用法律领域的三个公开数据集,调查其大小的变化、火车/测试的分解以及人类标签的准确性如何影响受过训练的深层次学习分类师的性能。我们评估了总体性能(加权平均数)以及每类性能。观察到的效果令人惊讶,特别是在考虑每类性能时。我们调查了某一类的“精密同质性”如何影响一个模型的测量性能。我们研究了在语义嵌入空间里句子的接近性能如何影响其分类的困难。介绍的结果对在AI & Law领域与数据收集和校正有关的工作产生了深远影响。结果还表明,在推进ML模型的同时,还可以考虑对数据集进行改进,作为提高AI & Law中各种任务分类性能的额外途径。最后,我们讨论了为提高AI & Law中各项任务分类性能所需的额外路径。