关于数据转换对地质领域机器学习分类的影响的经验观测 (Empirical observations on the effects of data transformation in machine learning classification of geological domains)

from arxiv, Keywords: Compositional data, supervised learning, geological domain, likelihood estimation, classification performance, effects of data transformation. 10 page article, 2 figures, 7 tables

In the literature, a large body of work advocates the use of log-ratio transformation for multivariate statistical analysis of compositional data. In contrast, few studies have looked at how data transformation changes the efficacy of machine learning classifiers within geoscience. This letter presents experiment results and empirical observations to further explore this issue. The objective is to study the effects of data transformation on geozone classification performance when machine learning (ML) classifiers/estimators are trained using geochemical data. The training input consists of exploration hole assay samples obtained from a Pilbara iron-ore deposit in Western Australia, and geozone labels assigned based on stratigraphic units, the absence or presence and type of mineralization. The ML techniques considered are multinomial logistic regression, Gaussian na\"{i}ve Bayes, kNN, linear support vector classifier, RBF-SVM, gradient boosting and extreme GB, random forest (RF) and multi-layer perceptron (MLP). The transformations examined include isometric log-ratio (ILR), center log-ratio (CLR) coupled with principal component analysis (PCA) or independent component analysis (ICA), and a manifold learning approach based on local linear embedding (LLE). The results reveal that different ML classifiers exhibit varying sensitivity to these transformations, with some clearly more advantageous or deleterious than others. Overall, the best performing candidate is ILR which is unsurprising considering the compositional nature of the data. The performance of pairwise log-ratio (PWLR) transformation is better than ILR for ensemble and tree-based learners such as boosting and RF; but worse for MLP, SVM and other classifiers.

翻译：在文献中,一大批工作主张使用日志-鼠标转换方法对组成数据进行多变量统计分析。相比之下,很少有研究研究研究数据转换如何改变地球科学中机器学习分类器的功效。本信介绍了实验结果和实验观察,以进一步探讨这一问题。目的是研究在机器学习(ML)分类器/估计器接受地球化学数据培训时数据转换对地理区分类性能的影响。培训投入包括从西澳大利亚Pilbara铁矿中采集的洞点检测样本,以及根据海拔敏感度、不存在或存在以及矿化类型而指定的地质区标签。MLL技术是多数值物流回归、高数值纳基纳伊夫(i)和巴耶斯、KNNN、线性支持矢量分类器、RBF-SVM、梯度加速和极端GB、随机森林(RF)和多数值透视(MLP)。所研究的变换包括了不甚精确的log-rai(ILL) 、中央日志-拉迪奥(CLLL) 和内层数据分析,以及主要部分分析(以更清晰的SLReval 进行这种分析,这些分析,这些分析是更精确、更精确、更精确、更精确的、更精确的、更精确的SLILILR)。