As AI algorithms increasingly participate in daily activities that used to be the sole province of humans, we are inevitably called upon to consider how much machines are really like us. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans. We establish a methodology to evaluate humans versus machines in Turing-like tests and systematically evaluate a representative set of selected domains, parameters, and variables. The experiments involved testing 769 human agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges, in 21,570 Turing tests across 6 tasks encompassing vision and language modalities. Surprisingly, the results reveal that current AIs are not far from being able to impersonate human judges across different ages, genders, and educational levels in complex visual and language challenges. In contrast, simple AI judges outperform human judges in distinguishing human answers versus machine answers. The curated large-scale Turing test datasets introduced here and their evaluation metrics provide valuable insights to assess whether an agent is human or not. The proposed formulation to benchmark human imitation ability in current AIs paves a way for the research community to expand Turing tests to other research areas and conditions. All of source code and data are publicly available at https://tinyurl.com/8x8nha7p
翻译:由于大赦国际的算法越来越多地参与曾经是人类唯一领域的日常活动,我们不可避免地需要考虑机器究竟有多少像我们一样。为了解决这个问题,我们转向图灵测试,系统地将目前人工智能与其模仿人类的能力基准化。我们在图灵式的测试中建立了一种评估人与机器对比的方法,并系统地评价了一组有代表性的选定领域、参数和变量。实验涉及测试769名人类代理人、24个最先进的人工智能代理人、896名人类法官和8名人工智能法官,在21 570年中,对包括视觉和语言模式的6项任务进行测试。令人惊讶的是,结果显示,目前的人工智能测试远远不能在复杂的视觉和语言挑战中冒充不同年龄、性别和教育水平的人类法官。相比之下,简单人工智能法官在区分人类答案和机器答案方面优于人类法官。在这里介绍的大规模图灵测试数据集及其评价指标为评估代理人是否为人提供了宝贵的洞察力。拟议为当前AI/8号研究领域为人类模仿能力制定基准的公式,在目前AI/8号研究领域和ALs All 数据库中为其他数据源铺路。