This case study investigates the task of job classification in a real-world setting, where the goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position. We explore multiple approaches to text classification, including supervised approaches such as traditional models like Support Vector Machines (SVMs) and state-of-the-art deep learning methods such as DeBERTa. We compare them with Large Language Models (LLMs) used in both few-shot and zero-shot classification settings. To accomplish this task, we employ prompt engineering, a technique that involves designing prompts to guide the LLMs towards the desired output. Specifically, we evaluate the performance of two commercially available state-of-the-art GPT-3.5-based language models, text-davinci-003 and gpt-3.5-turbo. We also conduct a detailed analysis of the impact of different aspects of prompt engineering on the model's performance. Our results show that, with a well-designed prompt, a zero-shot gpt-3.5-turbo classifier outperforms all other models, achieving a 6% increase in Precision@95% Recall compared to the best supervised approach. Furthermore, we observe that the wording of the prompt is a critical factor in eliciting the appropriate "reasoning" in the model, and that seemingly minor aspects of the prompt significantly affect the model's performance.
翻译:本案例研究调查现实世界环境中的职务分类任务,目的是确定英语职位布局是否适合研究生或初级职位。 我们探索多种文本分类方法,包括支持矢量机(SVMS)等传统模型和诸如DeBERTA等最先进的深层次学习方法等传统模型的监督性方法。 我们将其与在微小和零光分类环境中使用的大语言模型(LLMS)进行对比。 为了完成这项任务,我们采用了迅速的工程技术,这一技术涉及设计指导LLMS实现预期产出的迅速性。 具体而言,我们评估两种商业上最先进的GPT-3.5语言模型、Sext-davinci-003和Gpt-3.5-turbo两种商业上最先进的语言模型的性能。我们还对迅速工程对模型性能的不同方面的影响进行了详细分析。我们的结果显示,如果设计得当即、零发Gpt-3.5-turbo分类方法比其他模型更符合所有模型,在预想产出中实现6%的快速性能。我们评估两种商业上最先进的GLPT-95-3.5-3.5级语言模型的性能的性能的性能,我们将大大地观测到最精确的准确性能。</s>