Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
翻译:监督式文本模型是政治学家的有价值工具,但存在几个障碍,包括手动标注文件的费用、检索罕见相关文件以进行注释的困难以及共享注释文档涉及的版权和隐私问题。本文提出了一个部分解决这三个问题的方案,即通过大型语言模型控制生成合成文本。本文提供了文本生成的概念概述,指导研究人员何时应该优先考虑不同的生成合成文本技术,讨论伦理问题,并提供了一种简单的技术来提高合成文本的质量。作者演示了合成文本的三个应用程序:生成描述乌克兰战斗的合成推文,生成描述指定政治事件的合成新闻文章以训练事件检测系统,以及训练句级民粹主义分类器的多语种流派宣言陈述语料库。