Conditional set generation learns a mapping from an input sequence of tokens to a set. Several NLP tasks, such as entity typing and dialogue emotion tagging, are instances of set generation. Sequence-to-sequence~(Seq2seq) models are a popular choice to model set generation, but they treat a set as a sequence and do not fully leverage its key properties, namely order-invariance and cardinality. We propose a novel algorithm for effectively sampling informative orders over the combinatorial space of label orders. Further, we jointly model the set cardinality and output by adding the set size as the first element and taking advantage of the autoregressive factorization used by Seq2seq models. Our method is a model-independent data augmentation approach that endows any Seq2seq model with the signals of order-invariance and cardinality. Training a Seq2seq model on this new augmented data~(without any additional annotations) gets an average relative improvement of 20% for four benchmarks datasets across models spanning from BART-base, T5-xxl, and GPT-3.
翻译:有条件设定生成会从一个输入序列的符号序列到一个集。 一些 NLP 任务, 如实体打字和对话情感标记, 是设定生成的例子。 序列到序列~ (Seq2seq) 模型是模型生成的流行选择, 但是它们把一组模型当作一个序列, 没有充分利用其关键属性, 即顺序差异和基点。 我们提出了一个新算法, 用于在标签订单的组合空间中有效抽样信息订单。 此外, 我们通过添加设定的大小作为第一个元素并利用Seq2seq 模型使用的自动递增因数化因子化, 来联合模拟设定的基点和输出。 我们的方法是一种依赖模型的数据增强方法, 将任何Seq2seqseq 模型与秩序差异和基点信号相连接。 在这种新增强的数据~ ( 无需附加说明) 上培训一个Seq2seqeqeq 模型, 得到平均20%的相对改进, 用于来自 BART- base、 T5- xl 和 GPT-3 。