High-Performance Computing (HPC) clusters are made up of a variety of node types (usually compute, I/O, service, and GPGPU nodes) and applications don't use nodes of a different type the same way. Resulting communication patterns reflect organization of groups of nodes, and current optimal routing algorithms for all-to-all patterns will not always maximize performance for group-specific communications. Since application communication patterns are rarely available beforehand, we choose to rely on node types as a good guess for node usage. We provide a description of node type heterogeneity and analyse performance degradation caused by unlucky repartition of nodes of the same type. We provide an extension to routing algorithms for Parallel Generalized Fat-Tree topologies (PGFTs) which balances load amongst groups of nodes of the same type. We show how it removes these performance issues by comparing results in a variety of situations against corresponding classical algorithms.
翻译:高性能计算( HPC) 群集由多种节点类型( 通常计算、 I/ O、 服务和 GPGPPUPU 节点) 组成, 应用程序不会以同样的方式使用不同类型的节点。 由此形成的通信模式反映了各节点组的组织结构, 而当前所有至所有模式的最佳路径算法并不总是能最大限度地提高特定群体通信的性能。 由于应用程序通信模式很少事先可用, 我们选择依赖节点类型作为节点使用的良好猜测。 我们给出了节点类型异性描述, 并分析了由于同一类型节点不幸运的重新分割而导致的性能退化。 我们为平行的普通的Fat- Te- Terologies 提供了路径算法( PGFTs) 扩展, 后者平衡了相同类型节点组之间的负荷。 我们展示了它如何通过将不同情况下的结果与相应的经典运算法进行比较来消除这些性能问题。