CNN-based surrogates have become prevalent in scientific applications to replace conventional time-consuming physical approaches. Although these surrogates can yield satisfactory results with significantly lower computation costs over small training datasets, our benchmarking results show that data-loading overhead becomes the major performance bottleneck when training surrogates with large datasets. In practice, surrogates are usually trained with high-resolution scientific data, which can easily reach the terabyte scale. Several state-of-the-art data loaders are proposed to improve the loading throughput in general CNN training; however, they are sub-optimal when applied to the surrogate training. In this work, we propose SOLAR, a surrogate data loader, that can ultimately increase loading throughput during the training. It leverages our three key observations during the benchmarking and contains three novel designs. Specifically, SOLAR first generates a pre-determined shuffled index list and accordingly optimizes the global access order and the buffer eviction scheme to maximize the data reuse and the buffer hit rate. It then proposes a tradeoff between lightweight computational imbalance and heavyweight loading workload imbalance to speed up the overall training. It finally optimizes its data access pattern with HDF5 to achieve a better parallel I/O throughput. Our evaluation with three scientific surrogates and 32 GPUs illustrates that SOLAR can achieve up to 24.4X speedup over PyTorch Data Loader and 3.52X speedup over state-of-the-art data loaders.
翻译:以CNN为基础的代孕器在科学应用中已变得很普遍,以取代传统的耗时物理方法。尽管这些代孕器可以产生令人满意的结果,因为小型培训数据集的计算成本会大大降低,但我们的基准结果显示,当培训使用大型数据集的代孕器进行培训时,数据载荷将成为主要的性能瓶颈。实际上,代孕器通常是用高分辨率的科学数据培训,可以很容易地到达梯度尺度。一些最先进的数据载荷器建议改进CNN一般培训的装载量;但是,在应用代理培训时,这些代孕器可以产生令人满意的结果,而计算成本成本成本成本成本成本成本成本低得多。在这个工作中,我们建议SOLAR,即一个代理数据载荷数据载载载量高,最终可以增加培训过程中的负荷过量。具体地,SOL首先生成一个预定的震动指数清单,从而优化全球访问顺序和缓冲驱逐计划,以最大限度地增加数据再利用和缓冲冲击率。随后,我们提议在SOL-X的重载工作量上进行较轻的加权计算不平衡,最后将S-X的加重装载工作量工作量工作量压速度提高到3,从而实现S-S-RODF的全面数据访问速度。通过S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-