Sequential labeling is a fundamental NLP task, forming the backbone of many applications. Supervised learning of Seq2Seq models (like T5) has shown great success on these problems. However there remains a significant disconnect between the training objectives of these models vs the metrics and desiderata we care about in practical applications. For example, a practical sequence tagging application may want to optimize for a certain precision-recall trade-off (of the top-k predictions) which is quite different from the standard objective of maximizing the likelihood of the gold labeled sequence. Thus to bridge this gap, we propose GROOT -- a simple yet effective framework for Generative Reward Optimization Of Text sequences. GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function. Using an iterative training regime, we first generate prediction candidates, then correct errors in them, and finally contrast those candidates (based on their reward values). As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics. Furthermore, GROOT also leads to improvements of the overall decoder distribution as evidenced by the quality gains of the top-$k$ candidates.
翻译:序列标签是一项基本的NLP任务,它构成了许多应用的支柱。监督地学习Seq2Seqeq模型(如T5)已经证明在这些问题上取得了巨大成功。但是,这些模型的培训目标与我们关心的实际应用指标和贬值之间仍然有很大的脱节。例如,一个实际的序列标记应用可能想要优化某种精确召回交易(顶点预测),这与尽可能扩大标注金序列的可能性的标准目标大不相同。为了缩小这一差距,我们建议GROOT -- -- 一个简单而有效的文本序列生成再向优化框架。GROOT通过培训一个基因化的顺序标签模型,将解码输出分布与(黑盒)奖励功能相匹配。我们利用一个迭代培训制度,首先产生预测候选人,然后纠正其中的错误,最后比较这些候选人(根据他们的奖赏价值)。正如在四项公共基准上的广泛实验所显示的,GROOOT大大改进了所有奖得最高值。此外,GROOOT还以总体质量来证明,将最高值的成绩作为候选人。