This research revisits the classic Turing test and compares recent large language models such as ChatGPT for their abilities to reproduce human-level comprehension and compelling text generation. Two task challenges -- summarization, and question answering -- prompt ChatGPT to produce original content (98-99%) from a single text entry and also sequential questions originally posed by Turing in 1950. The question of a machine fooling a human judge recedes in this work relative to the question of "how would one prove it?" The original contribution of the work presents a metric and simple grammatical set for understanding the writing mechanics of chatbots in evaluating their readability and statistical clarity, engagement, delivery, and overall quality. While Turing's original prose scores at least 14% below the machine-generated output, the question of whether an algorithm displays hints of Turing's truly original thoughts (the "Lovelace 2.0" test) remains unanswered and potentially unanswerable for now.
翻译:这项研究重新审视了典型的图灵测试,并比较了最近的大型语言模型,如查特格普特(ChatGPT),以了解其复制人文层面的理解和有说服力的文本生成的能力。两个任务挑战 -- -- 总结和问答 -- -- 促使查特格普特(ChatGPT)从一个单一文本条目中产生原始内容(98-99%),以及图灵(Turing)最初在1950年提出的顺序问题。机器愚弄一名人类法官的问题在这项工作中与“如何证明这一点”的问题相对。 这项工作的最初贡献提供了一套衡量和简单的语法,用于理解聊天器的写作力,以评价其可读性和统计清晰度、参与、交付和总体质量。图灵的原始发音至少比机器产出低14%,但算法是否显示图灵真正原创思想的提示(“Lovlace 2.0” 测试) 的问题目前仍然无法解答,而且可能无法解答。