Motivated by the end of Moore's Law and Dennard Scaling which necessitate architectural efficiency as the means for improved capability for the next decade or two, this paper introduces a new data-rich paradigm of chip design for the semi-conductor industry. The goal is to enable monitoring chip hardware behavior in the field, at real-time speeds with no slowdowns, with minimal power overheads and obtain insights on chip behavior and workloads. We posit that, such extensive amounts of data would allow better and more capable architectures addressing three problems: obfuscated hardware, obfuscated software, and inability of A/B testing for hardware ideas. This paper implements the first version of the paradigm with a system architecture and the concept of an analYtics Processing Unit (YPU). We perform 4 case studies, and implement an RTL level prototype. Across the case studies we show a YPU with area overhead $<1 \%$ at 7nm, and overall power consumption of $<25 mW$ is able to create previously inconceivable analysis: per-instruction cycles stacks of arbitrary programs, evaluating instruction prefetchers in the wild before deployment, fine-grained cycle-by-cycle utilization of hardware modules, and histograms of tensor-value distributions of DL models.
翻译:暂无翻译