Most works on modeling the conversation history in Conversational Question Answering (CQA) report a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g. from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach, and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy-to-plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.
翻译:多数模式的工作是模拟对话问答(CQA)中对话历史,报告关于共同的CQA基准的单一主要结果。虽然现有模型显示CQA领导板上令人印象深刻的成果,但尚不清楚这些模型是否强大,能够在设置(有时更现实)和培训数据规模(例如从大组到小组)和域方面的转变方面(从培训数据规模(从大组到小组)和领域)进行。在这项工作中,我们设计和进行CQA历史建模方法的首次大规模稳健研究。我们发现,高基准分不一定转化为强健健,不同环境下各种方法的性能极不尽相同。根据我们的研究的见解,我们设计了一种新的基于迅速的历史建模方法,并展示了它在不同环境中的强大强健健。我们的方法受到强调,而不是通过修改标语嵌,我们在文本文本文本中直接添加文字提示。我们的方法简单易加到任何模式中,而且非常有效,因此我们建议它作为未来建模的起点,我们还会提高C型定位的重要性。我们还希望通过高分的研究,提高我们的C级评估。