Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore increasingly large neural architectures, with some recent Transformer models spanning hundreds of billions of learnable parameters. These designs have introduced new scale-driven systems challenges for the DL space, such as memory bottlenecks, poor runtime efficiency, and high costs of model development. Efforts to address these issues have explored techniques such as parallelization of neural architectures, spilling data across the memory hierarchy, and memory-efficient data representations. This survey will explore the large-model training systems landscape, highlighting key challenges and the various techniques that have been used to address them.
翻译:深层学习(DL)改变了各个领域的应用,包括计算机视觉、自然语言处理和表格数据分析。探索改进DL模型准确性的结果,使实践者探索了越来越多的大型神经结构,最近的一些变异模型覆盖了数千亿可学习参数。这些设计为DL空间带来了新的规模驱动系统挑战,如记忆瓶颈、运行效率低和模型开发成本高。解决这些问题的努力探索了神经结构平行化、数据跨越记忆层以及记忆效率数据表述等技术。这项调查将探索大型模型培训系统景观,突出关键挑战和用于应对这些挑战的各种技术。