We consider the problem of distilling optimal network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency-bandwidth tradeoff given a collective communication workload. Our algorithmic framework allows us to start from small base topologies and associated communication schedules and use a set of techniques that can be iteratively applied to derive much larger topologies and associated schedules. Our approach allows us to synthesize many different topologies and schedules for a given cluster size and degree constraint, and then identify the optimal topology for a given workload. We provide an analytical-model-based evaluation of the derived topologies and results on a small-scale optical testbed that uses patch panels for configuring a topology for the duration of an application's execution. We show that the derived topologies and schedules provide significant performance benefits over existing collective communications implementations.
翻译:我们考虑的是为集体通信蒸馏最佳网络地形的问题。我们提供了一个算法框架,用于根据集体通信工作量,为长期带宽权衡建立最优化的直接连接地形。我们的算法框架使我们能够从小型基础地形和相关通信时间表开始,并使用一系列技术,这些技术可以反复应用,以得出更大的地形和相关时间表。我们的方法使我们能够为特定组群大小和程度限制综合许多不同的地形和时间表,然后为特定工作量确定最佳地形。我们提供了一个基于分析模型的对衍生地形和结果的小型光学测试台的评估,该测试台使用补丁板在应用程序执行期间配置地形。我们显示,衍生的地形和时间表为现有集体通信实施提供了显著的业绩效益。