Large language models have gained considerable interest for their impressive performance on various tasks. Among these models, ChatGPT developed by OpenAI has become extremely popular among early adopters who even regard it as a disruptive technology in many fields like customer service, education, healthcare, and finance. It is essential to comprehend the opinions of these initial users as it can provide valuable insights into the potential strengths, weaknesses, and success or failure of the technology in different areas. This research examines the responses generated by ChatGPT from different Conversational QA corpora. The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels. Evaluation scores were also computed and compared to determine the overall performance of GPT-3 \& GPT-4. Additionally, the study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
翻译:大型语言模型因其在各种任务中的出色表现而受到广泛关注。在这些模型中,由OpenAI开发的ChatGPT因其在客户服务、教育、医疗和金融等领域的潜在优势而成为早期采用者们偏爱的选择,甚至被视为 disruptive technology。<br> 为了深入了解这些初始用户的观点,以了解该技术在不同领域的潜在优势、弱点以及成功或失败等方面提供有价值的洞见。本研究检查了来自不同 Conversational QA corpus 的ChatGPT生成的响应,并采用BERT相似度得分与正确答案进行比较并获得自然语言推理 NLI 标签。还计算并比较了评估得分,以确定GPT-3和GPT-4的整体表现。此外,本研究还确定了ChatGPT对问题提供错误答案的情况,并提供了模型可能容易出错的领域的见解。