Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This density is wasteful, making models less interpretable and assigning probability mass to many implausible outputs. In this paper, we propose sparse sequence-to-sequence models, rooted in a new family of $\alpha$-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any $\alpha > 1$. We provide fast algorithms to evaluate these transformations and their gradients, which scale well for large vocabulary sizes. Our models are able to produce sparse alignments and to assign nonzero probability to a short list of plausible outputs, sometimes rendering beam search exact. Experiments on morphological inflection and machine translation reveal consistent gains over dense models.
翻译:序列到序列模型是NLP的强大工作马。 大多数变异体在它们的注意机制和输出层中都采用软式最大变换,导致密集对齐和绝对正产出概率。 这种密度是浪费的,使模型不易解释,并且将概率质量分配给许多不可信的产出。 在本文中,我们提出了稀有的序列到序列模型,植根于一个新的组合,即$alpha$-entmax变形,其中包括软式变形和稀释式变形,作为特例,并且为任何$\alpha > 1美元而稀释。我们提供了快速算法来评价这些变形及其梯度,这些变形对于大词汇大小来说规模很好。我们的模型能够产生稀疏的对齐,并将非零概率分配给一个貌似产出的短列表,有时可以精确地进行光谱搜索。 有关形态变形和机器变形的实验揭示了密度模型的一致收益。