A crucial issue of current text generation models is that they often uncontrollably generate factually inconsistent text with respective of their inputs. Limited by the lack of annotated data, existing works in evaluating factual consistency directly transfer the reasoning ability of models trained on other data-rich upstream tasks like question answering (QA) and natural language inference (NLI) without any further adaptation. As a result, they perform poorly on the real generated text and are biased heavily by their single-source upstream tasks. To alleviate this problem, we propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. WeCheck first utilizes a generative model to accurately label a real generated sample by aggregating its weak labels, which are inferred from multiple resources. Then, we train the target metric model with the weak supervision while taking noises into consideration. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4\% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
翻译:目前文本生成模型的一个关键问题是,这些模型往往难以控制地产生与其投入中各自输入的不一致的文字。由于缺乏附加说明的数据,目前评估事实一致性的工作受到限制。现有的评估实际一致性的工作直接转移了在诸如问答和自然语言推论等其他数据丰富的上游任务方面受过训练的模型的推理能力,而没有进行任何进一步的调整。结果,它们在实际生成的文本上表现不佳,并严重偏向于其单一来源的上游任务。为了缓解这一问题,我们建议建立一个监督薄弱的框架,将多种资源集中起来,以训练精确有效的事实衡量标准,即WeCryt。我们首先检查一种基因模型,用从多种资源中推断的聚合其薄弱标签,从而准确标出一个实际生成的样本。然后,我们用薄弱的监督方法来培训目标指标模型,同时考虑到噪音。对各种任务进行的全面试验表明,Weck公司的业绩强劲,比TRUE标准以往的平均水平得到3.4%的绝对改进。