Large Language Models (LLMs) promise to overcome limitations of rule-based mental health chatbots through more natural conversations. However, evaluating LLM-based mental health chatbots presents a significant challenge: Their probabilistic nature requires comprehensive testing to ensure therapeutic quality, yet conducting such evaluations with people with depression would impose an additional burden on vulnerable people and risk exposing them to potentially harmful content. Our paper presents an evaluation approach for LLM-based mental health chatbots that combines dialogue generation with artificial users and dialogue evaluation by psychotherapists. We developed artificial users based on patient vignettes, systematically varying characteristics such as depression severity, personality traits, and attitudes toward chatbots, and let them interact with a LLM-based behavioral activation chatbot. Ten psychotherapists evaluated 48 randomly selected dialogues using standardized rating scales to assess the quality of behavioral activation and its therapeutic capabilities. We found that while artificial users showed moderate authenticity, they enabled comprehensive testing across different users. In addition, the chatbot demonstrated promising capabilities in delivering behavioral activation and maintaining safety. Furthermore, we identified deficits, such as ensuring the appropriateness of the activity plan, which reveals necessary improvements for the chatbot. Our framework provides an effective method for evaluating LLM-based mental health chatbots while protecting vulnerable people during the evaluation process. Future research should improve the authenticity of artificial users and develop LLM-augmented evaluation tools to make psychotherapist evaluation more efficient, and thus further advance the evaluation of LLM-based mental health chatbots.
翻译:暂无翻译