Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from the data structure traversals, which cause irregular memory access patterns and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Of these, Dalorex demonstrated high scalability by having the entire dataset on-chip, scattered across processing units (PU), and executing the tasks at the PU where the data is local. However, the communication needs of this approach do not scale with system sizes beyond 10k cores, and both the ability to handle larger datasets and how to achieve a cost-efficient design for production remain unanswered. To address these challenges, we propose a throughput-aware scalable chiplet architecture for distributed execution (Tascade), a multi-node system design that we evaluate with up to 256 distributed chips, a total of 1 million PUs. We introduce a programming model that scales to this level through proxy regions and selective cascading that reduce communication needs and improve load balancing. In addition, package-time reconfiguration of our large-scale chip design enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost. We evaluate six applications and four datasets, with several configurations and memory technologies to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs, the largest of the literature, reaches 3021 GTEPS.
翻译:暂无翻译