A novel combination of two widely-used clustering algorithms is proposed here for the detection and reduction of high data density regions. The Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is used for the detection of high data density regions and the k-means algorithm for reduction. The proposed algorithm iterates while successively decrementing the DBSCAN search radius, allowing for an adaptive reduction factor based on the effective data density. The algorithm is demonstrated for a physics simulation application, where a surrogate model for fusion reactor plasma turbulence is generated with neural networks. A training dataset for the surrogate model is created with a quasilinear gyrokinetics code for turbulent transport calculations in fusion plasmas. The training set consists of model inputs derived from a repository of experimental measurements, meaning there is a potential risk of over-representing specific regions of this input parameter space. By applying the proposed reduction algorithm to this dataset, this study demonstrates that the training dataset can be reduced by a factor ~20 using the proposed algorithm, without a noticeable loss in the surrogate model accuracy. This reduction provides a novel way of analyzing existing high-dimensional datasets for biases and consequently reducing them, which lowers the cost of re-populating that parameter space with higher quality data.
翻译:为探测和减少高数据密度区域,在此提议采用两种广泛使用的群集算法的新组合法,以探测和减少高数据密度区域。 以密度为基础的有噪音的应用空间集中(DBSCAN)算法用于探测高数据密度区域和K手段算法的减少。 拟议的算法在连续减少DBSCAN搜索半径的同时重复,允许根据有效数据密度而降低适应系数。 该算法用于物理模拟应用,该应用法通过神经网络生成聚变反应堆等离子体波变的代用模型模型。 代孕模型的培训数据集以准线性陀螺仪编码创建,用于计算聚变等离子体中的扰动运输计算。 这套培训数据集由实验性测量库产生的模型输入组成,这意味着有可能根据有效数据密度对输入参数空间的特定区域进行过度的调整。 通过对这个数据集应用拟议的减少算法,这一算法表明,可使用一个系数 ~20 来减少培训数据集,而无需在替代模型精度精确度上明显损失。 降低现有空间数据质量的精确度,从而以新的方式分析现有高空间精确度,从而降低现有数据。