Addressing the intricacies of open-domain question answering (QA) necessitates the extraction of pertinent information from expansive corpora to formulate precise answers. This paper introduces an innovative methodology, termed Generator-Retriever-Generator (GRG), which synergizes document retrieval strategies with advanced large language models (LLMs). The process commences with the LLM generating context-specific documents in response to a posed question. Concurrently, a sophisticated dual-encoder network undertakes the retrieval of documents pertinent to the question from an extensive external corpus. Both the generated and retrieved documents are subsequently processed by a second LLM, tasked with producing the definitive answer. By amalgamating the processes of document retrieval and LLM-based generation, our method adeptly navigates the complexities associated with open-domain QA, notably in delivering informative and contextually apt answers. Our GRG model demonstrably surpasses existing state-of-the-art methodologies, including generate-then-read and retrieve-then-read frameworks (GENREAD and RFiD), enhancing their performance by minimum margins of +5.2, +4.2, and +1.6 on the TriviaQA, NQ, and WebQ datasets, respectively. For further exploration and replication of our findings, we have made available the code, datasets, and checkpoints at \footnote{\url{https://github.com/abdoelsayed2016/GRG}}.
翻译:暂无翻译