使用风险卡评估语言模型部署 (Assessing Language Model Deployment with Risk Cards)

This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work establishes a wide variety of language model harms to many different actors: existing taxonomies identify categories of harms posed by language models; benchmarks establish automated tests of these harms; and documentation standards for models, tasks and datasets encourage transparent reporting. However, there is no risk-centric framework for documenting the complexity of a landscape in which some risks are shared across models and contexts, while others are specific, and where certain conditions may be required for risks to manifest as harms. RiskCards address this methodological gap by providing a generic framework for assessing the use of a given language model in a given scenario. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. While RiskCards are designed to be open-source, dynamic and participatory, we present a "starter set" of RiskCards taken from a broad literature survey, each of which details a concrete risk presentation. Language model RiskCards initiate a community knowledge base which permits the mapping of risks and harms to a specific model or its application scenario, ultimately contributing to a better, safer and shared understanding of the risk landscape.

翻译：本文介绍了风险卡(RiskCards)，这是一个结构化评估和文档化与语言模型应用相关风险的框架。与所有语言一样，由语言模型生成的文本可能是有害的，或者用于造成伤害。自动化语言生成不仅增加了规模的因素，而且还增加了生成文本的更微妙或新发的不良倾向。先前的工作确定了许多不同参与者对语言模型的伤害范围，现有的分类法确定了语言模型造成的伤害类别，基准测试建立了这些伤害的自动化测试，模型、任务和数据集的文档标准鼓励透明报告。然而，没有一个以风险为中心的框架来记录某个模型的使用所涉及到的用于展示伤害的复杂场景，在这些场景中，某些风险跨模型和上下文共享，而其他风险则是特定的，某些条件可能需要才能使风险表现为伤害。RiskCards通过提供一个通用框架来评估在特定方案中使用给定语言模型的风险来解决这个方法论上的差距。每张风险卡清晰说明了风险表现为伤害的路线，它们在伤害分类法中的位置以及示例提示-输出对。虽然RiskCards被设计为开源，动态和参与型，但我们提供了从广泛文献调查中获取的“入门级”RiskCards，其中每个都详细说明了一个具体的风险呈现方式。语言模型的风险卡启动了一个社区知识库，允许将风险和伤害映射到特定模型或其应用场景，最终为风险领域提供更好、更安全和共享的理解。