Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. Among existing work, a popular solution is to use pre-trained language models to score candidate choices directly conditioned on the question or context. However, such scores from language models can be easily affected by irrelevant factors, such as word frequencies, sentence structures, etc. These distracting factors may not only mislead the model to choose a wrong answer but also make it oversensitive to lexical perturbations in candidate answers. In this paper, we present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering. Instead of directly scoring each answer choice, our method first generates a set of plausible answers with generative models (e.g., GPT-2), and then uses these plausible answers to select the correct choice by considering the semantic similarity between each plausible answer and each choice. We devise a simple, yet sound formalism for this idea and verify its effectiveness and robustness with extensive experiments. We evaluate the proposed method on four benchmark datasets, and our method achieves the best results in unsupervised settings. Moreover, when attacked by TextFooler with synonym replacement, SEQA demonstrates much less performance drops than baselines, thereby indicating stronger robustness.
翻译:不受监督的常见问题解答并不依赖任何有标签的任务数据,因此具有吸引力。在现有工作中,流行的解决方案是使用预先训练的语言模型,直接根据问题或背景对候选人的选择进行评分;然而,语言模型的得分很容易受到不相关因素的影响,如字频、句号结构等。这些分散因素不仅会误导模型,以选择错误答案,而且会使其过于敏感于候选人答案中的逻辑干扰。在本文中,我们提出了一个新的基于语义的解答方法(SEQA),用于未受监督的常见问题解答。我们的方法不是直接评分每一个答案,而是用发型模型(如GPT-2)来生成一套可信的答案,然后用这些貌似答案来选择正确的选择,方法是考虑到每个可信的答案和每个选择之间的语义相似性。我们为这个想法设计了一个简单而又可靠的形式主义,并通过广泛的实验来核查其有效性和稳健性。我们评估了四个基准数据集的拟议方法,而不是直接评分每个答案,我们的方法产生一套可信的答案,而不是直接的答案,然后用发型模型来显示最强的成绩。