Conditional set generation learns a mapping from an input sequence of tokens to a set. Several NLP tasks, such as entity typing and dialogue emotion tagging, are instances of set generation. Seq2Seq models, a popular choice for set generation, treat a set as a sequence and do not fully leverage its key properties, namely order-invariance and cardinality. We propose a novel algorithm for effectively sampling informative orders over the combinatorial space of label orders. We jointly model the set cardinality and output by prepending the set size and taking advantage of the autoregressive factorization used by Seq2Seq models. Our method is a model-independent data augmentation approach that endows any Seq2Seq model with the signals of order-invariance and cardinality. Training a Seq2Seq model on this augmented data (without any additional annotations) gets an average relative improvement of 20% on four benchmark datasets across various models: BART, T5, and GPT-3. Code to use SETAUG available at: https://setgen.structgen.com.
翻译:有条件设定的生成会从一个输入序列的质物到一组的映射。 一些 NLP 任务, 如实体打字和对话情感标记等, 是设定生成的实例 。 Seq2Seq 模型, 一种对设定生成的流行选择, 将一组作为序列处理, 不充分利用其关键属性, 即秩序变化和基本特性 。 我们提出了一个新颖的算法, 用于对标签订单组合空间的信息订单进行有效采样。 我们通过预先设定设定设定大小并利用 Seq2Seq 模型使用的自动递增因子化来模拟设定的基点和输出。 我们的方法是一种依赖模型的数据增强方法, 将任何Seq2Seqeq 模型与秩序变化和重度信号相连接。 培训Seq2Sequeq 模型的这种强化数据模型( 没有任何附加说明) 获得不同模型四个基准数据集的20%的平均相对改进 : BART、 T5 和 GPTT3.。 使用 SETAUG 代码, 可在 https:// setgen. streportcgen.com.