This paper discusses creating and analysing a new dataset for data mining and text analytics research, contributing to a joint Leeds University research project for the Corpus of National Dialects. This report investigates machine learning classifiers to classify samples of French dialect text across various French-speaking countries. Following the steps of the CRISP-DM methodology, this report explores the data collection process, data quality issues and data conversion for text analysis. Finally, after applying suitable data mining techniques, the evaluation methods, best overall features and classifiers and conclusions are discussed.
翻译:本文件讨论为数据挖掘和文本分析研究创建和分析新的数据集,为利兹大学为国家剖面体联合开展的研究项目作出贡献。本报告调查机器学习分类人员对法语国家的法语方言文本样本进行分类。根据CRISP-DM方法的步骤,本报告探讨数据收集过程、数据质量问题和文本分析的数据转换。最后,在采用适当的数据挖掘技术之后,讨论了评估方法、最佳总体特征、分类和结论。