With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog memories. On the software side, the sequential Backpropagation algorithm prevents efficient parallelization and thus fast convergence. A novel method, Direct Feedback Alignment, resolves inherent layer dependencies by directly passing the error from the output to each layer. At the intersection of hardware/software co-design, there is a demand for developing algorithms that are tolerable to hardware nonidealities. Therefore, this work explores the interrelationship of implementing bio-plausible learning in-situ on neuromorphic hardware, emphasizing energy, area, and latency constraints. Using the benchmarking framework DNN+NeuroSim, we investigate the impact of hardware nonidealities and quantization on algorithm performance, as well as how network topologies and algorithm-level design choices can scale latency, energy and area consumption of a chip. To the best of our knowledge, this work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa. The best results achieved for accuracy remain Backpropagation-based, notably when facing hardware imperfections. Direct Feedback Alignment, on the other hand, allows for significant speedup due to parallelization, reducing training time by a factor approaching N for N-layered networks.
翻译:随着越来越多的参数定义越来越复杂的网络,深度学习已经取得了多个突破,超越了人类的表现。因此,这数百万个模型参数的数据移动导致了一个日益增长的不平衡,即内存壁。神经形态计算是一种新兴的范式,通过直接在模拟存储器中执行计算来解决这个不平衡。在软件方面,顺序反向传播算法阻止了高效并行化和快速收敛。一种新颖的方法,直接反馈对齐,通过直接将误差从输出传递到每个层来解决固有的层间依赖性。在硬件/软件协同设计的交叉处,需开发能够耐受硬件非理想性的算法。因此,本文探讨了在硬件上原位实现生物可行学习的互相关性,强调能源、面积和延迟约束。利用基准测试框架DNN+NeuroSim,我们研究了硬件非理想性和量化对算法性能的影响,以及网络拓扑和算法级设计选择如何扩展芯片的延迟、能量和面积消耗。据我们所知,本文是第一篇比较不同学习算法对计算内存硬件的影响并反之的工作。准确性方面取得的最佳结果仍然是基于反向传输的,特别是在面对硬件不完美时。而直接反馈对齐,则允许显著加速由于并行化而降低了N层网络的训练时间。