Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5, WebGPT and Natural Questions. Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers -- finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.
翻译:由多个句子组成的长式答案,可以对更广泛的一系列问题提供细致而全面的答案。为了更好地了解这一复杂和研究不足的任务,我们研究了从三个数据集(ELI5、WebGPT和自然问题)中收集的长式答案的功能结构。我们的主要目标是了解人类如何组织信息来制定复杂的答案。我们为长式答案制定了六个句子功能的本体学,并在640个回答段落中提出了3.9k句注解。不同对话结构中都体现了不同的答案收集方法。我们进一步分析模型生成的答案 -- -- 发现在说明模型生成的答案时,警告者与说明的人写答案相比意见不那么一致。我们附加说明的数据有助于培训一个可用于自动分析的强有力的分类师。我们希望我们的工作能够激发对长式QA系统的讨论级别建模和评估的未来研究。