光质变换器 (Optical Transformers)

The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for optical computing. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using simulations, validated by our experiments, we then explored the energy efficiency of optical implementations of Transformers and identified scaling laws for model performance with respect to optical energy usage. We found that the optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width, an asymptotic advantage over digital systems. We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a $>8,000\times$ energy-efficiency advantage over state-of-the-art digital-electronic processors that achieve 300 fJ/MAC. We analyzed how these results motivate and inform the construction of future optical accelerators along with optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5$\times$ cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimated that optical computers' advantage against current 300-fJ/MAC digital processors could grow to $>100,000\times$.

翻译：快速扩大的深层学习模型规模导致对数字计算机替代方法的更新和日益增长的兴趣,以大幅降低运行最先进的神经网络的能源成本。光基矩阵矢量乘数最适合使用非常大型的软件进行计算,这表明大型变压器模型可以成为光学计算的良好目标。为了测试这一想法,我们用一个原型加速器进行了小规模光学实验,以证明变压器操作可以使用光学硬件,尽管有噪音和错误。通过模拟,我们随后探索了变压器光学应用的能效,并确定了光学能源使用方面模型性能的缩放法。我们发现,每倍累积(MAC)的光学量比值是$frac{1 ⁇ d},这表明,美元是变压宽的,是数字系统的一个微弱的优势。我们得出结论,如果设计完善的大型光学硬件,则有可能实现100美元的时间效率优势,用于运行一些最大的当前变压器的变压器的变压器,并确定了光学技术的使用。我们发现,每个倍的增量的速计算值的速率,如果模型和光基的变压的机能能能能能能能将比现在的变压的变速速度都能到现在的变速, 我们的变速机的变压的变压的变速, 将可以使这些变速压的变速的变速的变速率能机的机的机能机能机能机能机能到到的变速率能机的变速。