SHAQ: 单头关注, 并不断重复 (SHAQ: Single Headed Attention with Quasi-Recurrence)

Natural Language Processing research has recently been dominated by large scale transformer models. Although they achieve state of the art on many important language tasks, transformers often require expensive compute resources, and days spanning to weeks to train. This is feasible for researchers at big tech companies and leading research universities, but not for scrappy start-up founders, students, and independent researchers. Stephen Merity's SHA-RNN, a compact, hybrid attention-RNN model, is designed for consumer-grade modeling as it requires significantly fewer parameters and less training time to reach near state of the art results. We analyze Merity's model here through an exploratory model analysis over several units of the architecture considering both training time and overall quality in our assessment. Ultimately, we combine these findings into a new architecture which we call SHAQ: Single Headed Attention Quasi-recurrent Neural Network. With our new architecture we achieved similar accuracy results as the SHA-RNN while accomplishing a 4x speed boost in training.

翻译：自然语言处理研究最近以大型变压器模式为主。虽然变压器在很多重要语言任务上达到了最新水平, 但变压器往往需要昂贵的计算资源, 并且需要数周时间来培训。对于大型科技公司和主要研究大学的研究人员来说, 这不可行, 但对于破旧的创业创始人、学生和独立研究人员来说则不可行。 Stephen Merity的SHA-RNN, 是一个紧凑的、混合关注-RNNN模型, 是为消费者级模型设计的, 因为它需要的参数要少得多, 培训时间也少一些, 才能达到接近最新水平。我们在这里通过对考虑到培训时间和总体质量的数个建筑单位进行探索模型分析来分析优势模型。最后, 我们把这些发现的结果合并成一个新的架构, 我们称之为 SHAQ: 单引力关注准神经网络。我们的新架构取得了类似于 SHA-RNNN, 同时在培训中实现了4x速度加速。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

注意力机制综述

专知会员服务

208+阅读 · 2021年1月26日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日