The GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks, showcasing their strong understanding and reasoning capabilities. However, their robustness and abilities to handle various complexities of the open world have yet to be explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3.5, exploring its robustness using 21 datasets (about 116K test samples) with 66 text transformations from TextFlint that cover 9 popular Natural Language Understanding (NLU) tasks. Our findings indicate that while GPT-3.5 outperforms existing fine-tuned models on some tasks, it still encounters significant robustness degradation, such as its average performance dropping by up to 35.74\% and 43.59\% in natural language inference and sentiment analysis tasks, respectively. We also show that GPT-3.5 faces some specific robustness challenges, including robustness instability, prompt sensitivity, and number sensitivity. These insights are valuable for understanding its limitations and guiding future research in addressing these challenges to enhance GPT-3.5's overall performance and generalization abilities.
翻译:GPT-3.5模型在各种自然语言处理(NLP)任务中表现出了令人印象深刻的成绩,展示了他们强有力的理解和推理能力,然而,这些模型的坚固性和处理开放世界各种复杂因素的能力尚有待探索,这对于评估模型稳定性特别重要,也是值得信赖的AI的一个关键方面。在本研究中,我们对GPT-3.5模型进行全面的实验分析,利用21个数据集(约116K测试样品)探索其坚固性,66个文本转换来自TextFlint,涵盖9项流行的自然语言理解(NLU)任务。我们的调查结果表明,虽然GPT-3.5模型超越了某些任务的现有微调模型,但仍遇到严重的稳健性退化,例如,其平均性能分别下降至35.74<unk> 和43.59<unk>,在自然语言的推断和情绪分析任务方面分别下降至35.74<unk> 和43.59<unk> 。我们还表明,GPT-35模型面临一些具体的稳健性挑战,包括稳健性不稳定、迅速敏感度和数量敏感度。这些深刻的见解有助于了解其局限性,指导今后在应对这些挑战方面的研究,以提高GPTPT-3.5总体业绩和普遍能力。</s>