Retrieval-Augmented Generation (RAG) systems are widely used across various industries for querying closed-domain and in-house knowledge bases. However, evaluating these systems presents significant challenges due to the private nature of closed-domain data and a scarcity of queries with verifiable ground truths. Moreover, there is a lack of analytical methods to diagnose problematic modules and identify types of failure, such as those caused by knowledge deficits or issues with robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules. Our validation experiments reveal that % traditional reference-free evaluation methods often inaccurately assess false generations, tending toward optimism. In contrast, GRAMMAR provides a reliable approach for identifying vulnerable modules and supports hypothesis testing for textual form vulnerabilities. % An open-source tool accompanying this framework will be released to easily reproduce our results and enable reliable and modular evaluation in closed-domain settings. An open-source tool accompanying this framework is available in our GitHub repository \url{https://github.com/xinzhel/grammar}, allowing for easy reproduction of our results and enabling reliable and modular evaluation in closed-domain settings.
翻译:暂无翻译