Scientific understanding is a fundamental goal of science, allowing us to explain the world. There is currently no good way to measure the scientific understanding of agents, whether these be humans or Artificial Intelligence systems. Without a clear benchmark, it is challenging to evaluate and compare different levels of and approaches to scientific understanding. In this Roadmap, we propose a framework to create a benchmark for scientific understanding, utilizing tools from philosophy of science. We adopt a behavioral notion according to which genuine understanding should be recognized as an ability to perform certain tasks. We extend this notion by considering a set of questions that can gauge different levels of scientific understanding, covering information retrieval, the capability to arrange information to produce an explanation, and the ability to infer how things would be different under different circumstances. The Scientific Understanding Benchmark (SUB), which is formed by a set of these tests, allows for the evaluation and comparison of different approaches. Benchmarking plays a crucial role in establishing trust, ensuring quality control, and providing a basis for performance evaluation. By aligning machine and human scientific understanding we can improve their utility, ultimately advancing scientific understanding and helping to discover new insights within machines.
翻译:科学理解是科学的根本目标,可以帮助我们解释世界。当前没有有效的方式来衡量智能体(无论是人类还是人工智能系统)的科学理解能力。没有明确的基准,评估和比较不同水平和方法的科学理解能力非常具有挑战性。在这篇路线图中,我们提出了一个框架,利用科学哲学工具创建科学理解基准。我们采用行为概念,认为真正的理解应该被认为是执行某些任务的能力。我们通过考虑一组问题来扩展这个概念,这些问题可以评估不同层次的科学理解,包括信息检索、组织信息以产生解释的能力、在不同情况下推断事物会有不同的能力等。科学理解基准(SUB)由这些测试组成,可以评估和比较不同的方法。基准测试在建立信任、确保质量控制和提供性能评估方面起着至关重要的作用。通过对机器和人类的科学理解进行对齐,我们可以提高它们的实用性,最终推动科学理解的进步,帮助在机器中发现新的见解。