This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.
翻译:本文详细阐述了CoPE的方法论,这是一种能够快速准确进行内容标注的策略可引导小型语言模型。我们提出了一种称为矛盾示例训练的新型训练课程,使模型能够学习策略解释而非单纯记忆策略。同时,我们提出了一种名为双目标注的内容策略生成新方法,能够快速构建无歧义的训练数据集。在七个不同危害领域的评估中,CoPE仅以前沿模型1%的参数量即展现出同等或更优的准确率。我们开源了90亿参数版本的模型,该版本可在单张消费级GPU上运行。CoPE此类模型代表了分类器系统的范式转变——通过将机器学习任务转化为策略编写任务,为在线平台的治理机制开辟了新的设计可能。