Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models

The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 model against multiple criteria. Results indicate -- even if IRR scores were generally low -- a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.

翻译：暂无翻译

相关内容

Elsevier

关注 0

爱思唯尔提供信息分析解决方案和数字化工具，包括研究战略管理、研发绩效、临床决策支持、专业教育等。其前身可追溯自16世纪，而现代公司则起于1880年，爱思唯尔出版2500余种期刊，包括《柳叶刀》、《四面体》、《细胞》以及教科书《格雷氏解剖学》等。每年共有350,000篇论文发表在爱思唯尔公司出版的期刊中，以及全世界最大的摘要和引文数据库Scopus等。爱思唯尔（Elsevier）是医学与其他科学文献出版社之一，2016年中国高被引学者榜单的研究数据来自爱思唯尔旗下的Scopus数据库，共有来自社会科学、物理、化学、数学、经济等38个学科的1776名有世界影响力的中国学者入选。学术出版业巨头爱思唯尔（Elsevier）正式发布2017年中国高被引学者（Chinese Most Cited Researchers）榜单，本次国内共有1793位学者入选。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日