Data-driven surrogate models are increasingly adopted to accelerate vehicle design. However, open-source multi-fidelity datasets and empirical guidelines linking dataset size to model performance remain limited. This study investigates the relationship between training data size and prediction accuracy for a graph neural network (GNN) based surrogate model for aerodynamic field prediction. We release an open-source, multi-fidelity aerodynamic dataset for double-delta wings, comprising 2448 flow snapshots across 272 geometries evaluated at angles of attack from 11 (degree) to 19 (degree) at Ma=0.3 using both Vortex Lattice Method (VLM) and Reynolds-Averaged Navier-Stokes (RANS) solvers. The geometries are generated using a nested Saltelli sampling scheme to support future dataset expansion and variance-based sensitivity analysis. Using this dataset, we conduct a preliminary empirical scaling study of the MF-VortexNet surrogate by constructing six training datasets with sizes ranging from 40 to 1280 snapshots and training models with 0.1 to 2.4 million parameters under a fixed training budget. We find that the test error decreases with data size with a power-law exponent of -0.6122, indicating efficient data utilization. Based on this scaling law, we estimate that the optimal sampling density is approximately eight samples per dimension in a d-dimensional design space. The results also suggest improved data utilization efficiency for larger surrogate models, implying a potential trade-off between dataset generation cost and model training budget.
翻译:数据驱动的代理模型正日益被用于加速飞行器设计。然而,开源的多保真度数据集以及将数据集规模与模型性能相关联的经验指导原则仍然有限。本研究针对基于图神经网络(GNN)的气动场预测代理模型,探究了训练数据规模与预测精度之间的关系。我们发布了一个开源的多保真度双三角翼气动数据集,包含272种构型下的2448个流动快照,这些构型在攻角11°至19°、马赫数0.3条件下,分别使用涡格法(VLM)和雷诺平均纳维-斯托克斯(RANS)求解器进行评估。几何构型采用嵌套的Saltelli抽样方案生成,以支持未来数据集扩展和基于方差的敏感性分析。利用该数据集,我们对MF-VortexNet代理模型进行了初步的经验缩放研究,构建了六个训练数据集,其规模从40到1280个快照不等,并在固定的训练预算下训练了参数量从0.1到240万的模型。我们发现测试误差随数据量增加以幂律指数-0.6122下降,表明数据利用效率较高。基于此缩放规律,我们估计在d维设计空间中,最优采样密度约为每维度八个样本。结果还表明,更大的代理模型具有更高的数据利用效率,这暗示了数据集生成成本与模型训练预算之间可能存在权衡。