来自UIUC的Transformers最新教程。

Transformer 架构 architecture Attention models Implementation details Transformer-based 语言模型 language models BERT GPT Other models

Transformer 视觉 Applications of Transformers in vision

成为VIP会员查看完整内容
0
117

相关内容

Transformer是谷歌发表的论文《Attention Is All You Need》提出一种完全基于Attention的翻译架构

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

注意力是一种在广泛的神经结构中使用的越来越流行的机制。由于这一领域的快速发展,仍然缺乏对注意力的系统概述。在本文中,我们定义了用于自然语言处理的注意力体系结构的统一模型,重点介绍了用于文本数据的向量表示的体系结构。我们讨论了以往工作的不同方面,注意力机制的可能用途,并描述了该领域的主要研究工作和公开挑战。

https://web.eecs.umich.edu/~justincj/slides/eecs498/FA2020/598_FA2020_lecture13.pdf

成为VIP会员查看完整内容
0
178

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has, without exaggeration, revolutionized the fields of natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage ranking architectures and learned dense representations that attempt to perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond the typical sentence-by-sentence processing approaches used in NLP, and techniques for addressing the tradeoff between effectiveness (result quality) and efficiency (query latency). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.

0
20
下载
预览

自然语言处理中的预训练模型

论文:【复旦大学】最新《预训练语言模型》2020综述论文大全,50+PTMs分类体系,25页pdf205篇参考文献

目前预训练模型在自然语言处理领域取得了广泛的成功。本报告的内容主要涵盖以下4部分内容:1)预训练模型的原理介绍,包括模型结构、学习准则、发展历程等;2)预训练模型的迁移方法:包括如何通过任务转换、多步迁移、改进精调等方法来进一步提高预训练模型在下游任务上的性能;3)预训练模型的改进模型:包括知识嵌入模型、多模态模型、多语言模型、语言特定模型、领域特定模型和模型压缩等;4)对预训练模型及其未来发展趋势进行展望。

成为VIP会员查看完整内容
0
86

Transformer由论文《Attention is All You Need》提出,现在是谷歌云TPU推荐的参考模型。Transformer是:“首个完全抛弃RNN的recurrence,CNN的convolution,仅用attention来做特征抽取的模型。“ 本文简介了Transformer模型。

成为VIP会员查看完整内容
0
53

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.

0
5
下载
预览

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

0
10
下载
预览

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.

0
4
下载
预览

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.

0
4
下载
预览
小贴士
相关VIP内容
专知会员服务
178+阅读 · 2020年11月24日
专知会员服务
53+阅读 · 2020年8月30日
专知会员服务
90+阅读 · 2020年8月7日
专知会员服务
81+阅读 · 2020年8月2日
专知会员服务
51+阅读 · 2020年6月8日
一份循环神经网络RNNs简明教程,37页ppt
专知会员服务
118+阅读 · 2020年5月6日
专知会员服务
113+阅读 · 2020年2月1日
相关论文
Jimmy Lin,Rodrigo Nogueira,Andrew Yates
20+阅读 · 2020年10月13日
Ruibin Xiong,Yunchang Yang,Di He,Kai Zheng,Shuxin Zheng,Chen Xing,Huishuai Zhang,Yanyan Lan,Liwei Wang,Tie-Yan Liu
4+阅读 · 2020年2月12日
Seongjun Yun,Minbyul Jeong,Raehyun Kim,Jaewoo Kang,Hyunwoo J. Kim
10+阅读 · 2020年2月5日
Kazuki Irie,Albert Zeyer,Ralf Schlüter,Hermann Ney
5+阅读 · 2019年7月11日
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai,Zhilin Yang,Yiming Yang,Jaime Carbonell,Quoc V. Le,Ruslan Salakhutdinov
10+阅读 · 2019年6月2日
Universal Transformers
Mostafa Dehghani,Stephan Gouws,Oriol Vinyals,Jakob Uszkoreit,Łukasz Kaiser
4+阅读 · 2019年3月5日
Star-Transformer
Qipeng Guo,Xipeng Qiu,Pengfei Liu,Yunfan Shao,Xiangyang Xue,Zheng Zhang
3+阅读 · 2019年2月28日
The Evolved Transformer
David R. So,Chen Liang,Quoc V. Le
5+阅读 · 2019年1月30日
Music Transformer
Cheng-Zhi Anna Huang,Ashish Vaswani,Jakob Uszkoreit,Noam Shazeer,Ian Simon,Curtis Hawthorne,Andrew M. Dai,Matthew D. Hoffman,Monica Dinculescu,Douglas Eck
4+阅读 · 2018年12月12日
Jian Li,Zhaopeng Tu,Baosong Yang,Michael R. Lyu,Tong Zhang
8+阅读 · 2018年10月24日
Top
微信扫码咨询专知VIP会员