以数据为基础的比较审查和AI驱动的自然流纵向分布系数象征性模型 (A data-based comparative review and AI-driven symbolic model for longitudinal dispersion coefficient in natural streams)

from arxiv, Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file Subjects

A better understanding of dispersion in natural streams requires knowledge of longitudinal dispersion coefficient(LDC). Various methods have been proposed for predictions of LDC. Those studies can be grouped into three types: analytical, statistical and ML-driven researches(Implicit and explicit). However, a comprehensive evaluation of them is still lacking. In this paper, we first present an in-depth analysis of those methods and find out their defects. This is carried out on an extensive database composed of 660 samples of hydraulic and channel properties worldwide. The reliability and representativeness of utilized data are enhanced through the deployment of the Subset Selection of Maximum Dissimilarity(SSMD) for testing set selection and the Inter Quartile Range(IQR) for removal of the outlier. The evaluation reveals the rank of those methods as: ML-driven method > the statistical method > the analytical method. Whereas implicit ML-driven methods are black-boxes in nature, explicit ML-driven methods have more potential in prediction of LDC. Besides, overfitting is a universal problem in existing models. Those models also suffer from a fixed parameter combination. To establish an interpretable model for LDC prediction with higher performance, we then design a novel symbolic regression method called evolutionary symbolic regression network(ESRN). It is a combination of genetic algorithms and neural networks. Strategies are introduced to avoid overfitting and explore more parameter combinations. Results show that the ESRN model has superiorities over other existing symbolic models in performance. The proposed model is suitable for practical engineering problems due to its advantage in low requirement of parameters (only w and U* are required). It can provide convincing solutions for situations where the field test cannot be carried out or limited field information can be obtained.

翻译：更好地认识自然流中的分散现象需要了解纵向分布系数(LDC)的知识。提出了预测最不发达国家的各种方法。这些研究可以分为三类:分析、统计和由ML驱动的研究(Implic和明确),然而,仍然缺乏对这些方法的全面评价。在本文件中,我们首先对这些方法进行深入分析,并找出其缺陷。这是在一个由全世界660个水力和管道特性样本组成的广泛数据库中进行的。通过部署用于测试既定选择的“子集选择最大相似性”(SSMD)和用于消除外部的“异域间Quartile Rang(IQR)”的子集选择(SSMD),提高了所用数据的可靠性和代表性。评估显示这些方法的等级是:ML驱动的方法 > 统计方法 > 分析方法。虽然隐含的ML驱动方法在性质上是黑箱,但明确的ML驱动方法在预测最不发达国家时具有更大的潜力。此外,在现有模型中,过度匹配是一个普遍的问题。这些模型还存在固定的参数组合。为最不发达国家测算结果而采用更精确的模型则需要采用一种可解释的内更精确的内变的模型。在进行。