In the mining industry, many reports are generated in the project management process. These past documents are a great resource of knowledge for future success. However, it would be a tedious and challenging task to retrieve the necessary information if the documents are unorganized and unstructured. Document clustering is a powerful approach to cope with the problem, and many methods have been introduced in past studies. Nonetheless, there is no silver bullet that can perform the best for any types of documents. Thus, exploratory studies are required to apply the clustering methods for new datasets. In this study, we will investigate multiple topic modelling (TM) methods. The objectives are finding the appropriate approach for the mining project reports using the dataset of the Geological Survey of Queensland, Department of Resources, Queensland Government, and understanding the contents to get the idea of how to organise them. Three TM methods, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Tensor Factorization (NTF) are compared statistically and qualitatively. After the evaluation, we conclude that the LDA performs the best for the dataset; however, the possibility remains that the other methods could be adopted with some improvements.
翻译:在采矿业,许多报告都是在项目管理过程中产生的。这些过去的文件是未来成功的知识的丰富资源。然而,如果文件没有组织和结构化,检索必要的信息将是一项乏味和艰巨的任务。文件群集是解决问题的有力办法,以往的研究也采用了许多方法。然而,没有任何类型的文件能够发挥最佳作用的银球球。因此,要应用新数据集的集群方法,需要进行探索性研究。在本研究中,我们将调查多个主题建模方法。目标是利用昆士兰州地质调查局、资源部、昆士兰州政府的数据集,找到采矿项目报告的适当方法,并了解如何组织这些方法的内容。三种TM方法,即Lenttt Dirichlet分配(LDA)、Nnnegive矩阵保值(NMF)和Nnegetive Tensor量化(NTF),在统计上和定性上进行比较。我们的结论是,有些LDA为数据集提供了最佳的方法;然而,其他方法仍有可能改进。