超越评分标准：性健康与生殖健康大语言模型基准中的文化错位问题 (Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health)

Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench's rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.

翻译：大语言模型（LLMs）被认为具有扩大全球南方地区健康信息获取的潜力，但其评估仍严重依赖围绕西方规范设计的基准。本文通过一项针对印度服务不足社区的性健康与生殖健康（SRH）聊天机器人初步基准测试，提出相关见解。我们采用OpenAI开发的对话健康模型基准HealthBench进行评估，从数据集中提取637条SRH查询，并对其中330轮单轮对话进行评估。使用HealthBench基于评分标准的自动评分器对回复进行评价，结果显示评分持续偏低。然而，经培训标注员和公共卫生专家定性分析发现，许多回复实际上具有文化适切性和医学准确性。我们重点揭示了反复出现的问题，特别是西方偏见，例如法律框架与规范（如公共场所母乳喂养）、饮食假设（如孕期食用鱼类的安全性）以及成本考量（如保险模式）。我们的研究结果表明，当前基准在评估针对不同文化和医疗背景构建的系统效能方面存在局限。我们主张开发符合质量标准、同时能识别多元群体需求的文化适应性评估框架。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日