This research revisits the classic Turing test and compares recent large language models such as ChatGPT for their abilities to reproduce human-level comprehension and compelling text generation. Two task challenges -- summarization, and question answering -- prompt ChatGPT to produce original content (98-99%) from a single text entry and also sequential questions originally posed by Turing in 1950. We score the original and generated content against the OpenAI GPT-2 Output Detector from 2019, and establish multiple cases where the generated content proves original and undetectable (98%). The question of a machine fooling a human judge recedes in this work relative to the question of "how would one prove it?" The original contribution of the work presents a metric and simple grammatical set for understanding the writing mechanics of chatbots in evaluating their readability and statistical clarity, engagement, delivery, and overall quality. While Turing's original prose scores at least 14% below the machine-generated output, the question of whether an algorithm displays hints of Turing's truly original thoughts (the "Lovelace 2.0" test) remains unanswered and potentially unanswerable for now.
翻译:这项研究重新审查了典型的图灵测试,并比较了最近的大型语言模型,如查特格普特,以了解其复制人文理解和令人信服的文本生成的能力。两个任务挑战 -- -- 总结和回答 -- -- 促使查特格普特从一个单一文本条目中产生原始内容(98-99%)以及图灵最初于1950年提出的顺序问题。我们从2019年与OpenAI GPT-2输出探测器对原始和生成的内容进行了评分,并建立了多个案例,其中生成的内容证明是原始和无法检测的(98%)。在这项工作中,一个机器欺骗一名人类法官的问题与“如何证明这一点”的问题有关。这项工作的原始贡献提供了一套简单和简单的语法集,用以理解聊天器的写法,用以评价其可读性和统计性、参与、交付和总体质量。图灵的原始方案评分至少比机器生成的产出低14%,但算法是否显示图灵真正原始思想的提示(“Levlace 2.0” 测试) 现在仍然无法解答,而且可能无法解答。