Information Retriever (IR) aims to find the relevant documents (e.g. snippets, passages, and articles) to a given query at large scale. IR plays an important role in many tasks such as open domain question answering and dialogue systems, where external knowledge is needed. In the past, searching algorithms based on term matching have been widely used. Recently, neural-based algorithms (termed as neural retrievers) have gained more attention which can mitigate the limitations of traditional methods. Regardless of the success achieved by neural retrievers, they still face many challenges, e.g. suffering from a small amount of training data and failing to answer simple entity-centric questions. Furthermore, most of the existing neural retrievers are developed for pure-text query. This prevents them from handling multi-modality queries (i.e. the query is composed of textual description and images). This proposal has two goals. First, we introduce methods to address the abovementioned issues of neural retrievers from three angles, new model architectures, IR-oriented pretraining tasks, and generating large scale training data. Second, we identify the future research direction and propose potential corresponding solution.
翻译:信息检索(IR)的目的是大规模地找到某个查询的相关文件(如片段、段落和文章)。IR在许多任务中发挥着重要作用,如开放域问答和对话系统,这些任务需要外部知识。过去,基于术语匹配的搜索算法被广泛使用。最近,神经算法(称为神经检索器)得到更多的注意,可以减轻传统方法的局限性。尽管神经检索器取得了成功,但它们仍然面临着许多挑战,例如,受到少量培训数据的影响,无法回答简单的实体中心问题。此外,大多数现有的神经检索器是为纯文本查询而开发的。这使得它们无法处理多模式查询(即查询由文字描述和图像组成),这个提议有两个目标。首先,我们从三个角度、新的模型结构、IR导向的预培训任务以及产生大规模培训数据,我们提出了解决上述神经检索器问题的方法。第二,我们确定了未来的研究方向,并提出了相应的解决办法。