LLM inference is increasingly memory bound, and HBM cost per GB dominates system cost. Current HBM stacks include short on-die ECC that tightens binning, raises price, and fixes reliability policy inside the device. This paper asks whether a system can tolerate a much higher raw HBM bit error rate and still keep end-to-end correctness and throughput, without changing the HBM PHY or the fixed 32 B transaction size. We propose REACH, a controller managed ECC design that keeps the HBM link and 32 B transfers unchanged. REACH uses a two level Reed-Solomon scheme: each 32 B chunk uses an inner code to check and correct most faults locally, while chunks that cannot be fixed are marked as erasures. An outer code spans kilobytes and runs in erasure only mode, repairing only flagged chunks and avoiding the expensive locator step. For small random writes, REACH updates outer parity with differential parity to avoid recomputing parity over the whole span, and an optional importance adaptive bit plane policy can protect only critical fields such as BF16 exponents to reduce ECC work and traffic. On three LLMs at 8K context, REACH keeps about 79 percent of on-die ECC throughput at zero BER and remains qualified up to a raw BER of 1e-3, extending tolerable device error rates by about three orders of magnitude while keeping tokens per second nearly flat. In ASAP7, a full REACH controller occupies 15.2 mm2 and consumes 17.5 W at 3.56 TB/s, and it reduces ECC area by 11.6x and power by about 60 percent compared to a naive long Reed-Solomon baseline. By moving strong ECC into the controller, REACH turns long code reliability into a system choice that can enable lower cost HBM under the same standard interface.


翻译:LLM推理日益受限于内存,而每GB的HBM成本主导着系统成本。当前的HBM堆栈包含短片上ECC,这收紧了分档标准、提高了价格,并将可靠性策略固定在器件内部。本文探讨了在不改变HBM物理层或固定的32字节传输大小的情况下,系统是否能够容忍高得多的原始HBM误码率,同时保持端到端的正确性和吞吐量。我们提出了REACH,一种由控制器管理的ECC设计,保持HBM链路和32字节传输不变。REACH采用两级里德-所罗门方案:每个32字节数据块使用一个内码来本地检查和纠正大多数故障,而无法修复的数据块则被标记为擦除。一个外码跨越数千字节,并仅在擦除模式下运行,仅修复已标记的数据块,避免了昂贵的定位步骤。对于小型随机写入,REACH使用差分奇偶校验更新外码奇偶校验,以避免在整个跨度上重新计算奇偶校验;此外,一个可选的重要性自适应位平面策略可以仅保护关键字段(如BF16指数),以减少ECC工作和流量。在8K上下文长度的三个LLM上,REACH在零误码率下保持了约79%的片上ECC吞吐量,并在原始误码率高达1e-3时仍保持合格,将可容忍的器件误码率提高了约三个数量级,同时使每秒生成的令牌数几乎保持平稳。在ASAP7工艺中,一个完整的REACH控制器面积为15.2 mm²,在3.56 TB/s的带宽下功耗为17.5 W,与一个简单的长里德-所罗门基线相比,其ECC面积减少了11.6倍,功耗降低了约60%。通过将强ECC移至控制器,REACH将长码可靠性转变为一种系统选择,从而能够在相同标准接口下实现更低成本的HBM。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员