后背插件之外的硬件:直接反馈调整光相共处理器 (Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment)

The scaling hypothesis motivates the expansion of models past trillions of parameters as a path towards better performance. Recent significant developments, such as GPT-3, have been driven by this conjecture. However, as models scale-up, training them efficiently with backpropagation becomes difficult. Because model, pipeline, and data parallelism distribute parameters and gradients over compute nodes, communication is challenging to orchestrate: this is a bottleneck to further scaling. In this work, we argue that alternative training methods can mitigate these issues, and can inform the design of extreme-scale training hardware. Indeed, using a synaptically asymmetric method with a parallelizable backward pass, such as Direct Feedback Alignement, communication needs are drastically reduced. We present a photonic accelerator for Direct Feedback Alignment, able to compute random projections with trillions of parameters. We demonstrate our system on benchmark tasks, using both fully-connected and graph convolutional networks. Our hardware is the first architecture-agnostic photonic co-processor for training neural networks. This is a significant step towards building scalable hardware, able to go beyond backpropagation, and opening new avenues for deep learning.

翻译：缩放假设刺激了过去数万亿参数模型的扩展,作为改善绩效的途径。最近的重大发展,如GPT-3-3号模型,是由这一推测驱动的。然而,随着模型的扩大,以反向反向反向法对其进行有效培训变得十分困难。由于模型、管道和数据平行法分布参数和梯度,计算节点的计算,通信具有挑战性:这是进一步缩小规模的瓶颈。在这项工作中,我们认为,替代培训方法可以缓解这些问题,也可以为极端规模培训硬件的设计提供参考。事实上,使用同步的不对称方法,同时使用可平行反向的反向通道,例如直接反向感应,通信需求就会急剧减少。我们展示了直接反向协调光学加速器,能够用数万亿参数进行随机预测。我们在基准任务上展示了我们的系统,同时使用完全连接的和图图式的革命网络。我们的硬件是用于培训神经网络的第一个建筑-不可知光共处理器。这是朝向建造可缩的硬件迈出的重要一步,能够超越反向反向反向反向反向方向,打开新的学习新途径。

相关内容

反向传播

关注 354

反向传播一词严格来说仅指用于计算梯度的算法，而不是指如何使用梯度。但是该术语通常被宽松地指整个学习算法，包括如何使用梯度，例如通过随机梯度下降。反向传播将增量计算概括为增量规则中的增量规则，该规则是反向传播的单层版本，然后通过自动微分进行广义化，其中反向传播是反向累积（或“反向模式”）的特例。在机器学习中，反向传播（backprop）是一种广泛用于训练前馈神经网络以进行监督学习的算法。对于其他人工神经网络（ANN）都存在反向传播的一般化–一类算法，通常称为“反向传播”。反向传播算法的工作原理是，通过链规则计算损失函数相对于每个权重的梯度，一次计算一层，从最后一层开始向后迭代，以避免链规则中中间项的冗余计算。

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

44+阅读 · 2020年12月18日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【反馈循环自编码器】FEEDBACK RECURRENT AUTOENCODER

专知会员服务

23+阅读 · 2020年1月28日

【Freddy Lecue博士】Thales嵌入式可解释AI：关键系统中AI的采用（Thales Embedded Explainable AI: Towards the Adoption of AI in Critical Systems.），AI Accelerator Summit 2019

专知会员服务

21+阅读 · 2019年11月11日