As generative AI systems are increasingly deployed in real-world applications, regulating multiple dimensions of model behavior has become essential. We focus on test-time filtering: a lightweight mechanism for behavior control that compares performance scores to estimated thresholds, and modifies outputs when these bounds are violated. We formalize the problem of enforcing multiple risk constraints with user-defined priorities, and introduce two efficient dynamic programming algorithms that leverage this sequential structure. The first, MULTIRISK-BASE, provides a direct finite-sample procedure for selecting thresholds, while the second, MULTIRISK, leverages data exchangeability to guarantee simultaneous control of the risks. Under mild assumptions, we show that MULTIRISK achieves nearly tight control of all constraint risks. The analysis requires an intricate iterative argument, upper bounding the risks by introducing several forms of intermediate symmetrized risk functions, and carefully lower bounding the risks by recursively counting jumps in symmetrized risk functions between appropriate risk levels. We evaluate our framework on a three-constraint Large Language Model alignment task using the PKU-SafeRLHF dataset, where the goal is to maximize helpfulness subject to multiple safety constraints, and where scores are generated by a Large Language Model judge and a perplexity filter. Our experimental results show that our algorithm can control each individual risk at close to the target level.
翻译:随着生成式人工智能系统在现实世界应用中的日益普及,对模型行为的多维度调控变得至关重要。本文聚焦于测试时过滤机制:一种轻量级的行为控制方法,通过将性能分数与估计阈值进行比较,并在超出界限时修改输出。我们形式化了具有用户定义优先级的多重风险约束强制执行问题,并引入了两种利用该顺序结构的高效动态规划算法。第一种算法MULTIRISK-BASE提供了直接的有限样本阈值选择流程,而第二种算法MULTIRISK则利用数据的可交换性来保证对多重风险的同步控制。在温和假设下,我们证明MULTIRISK能实现对所有约束风险近乎严密的控制。该分析需要复杂的迭代论证:通过引入多种形式的中间对称化风险函数来上界风险,并通过递归计算对称化风险函数在适当风险水平间的跳跃来谨慎地下界风险。我们在基于PKU-SafeRLHF数据集的三约束大语言模型对齐任务中评估了该框架,其目标是在多重安全约束下最大化帮助性,其中分数由大语言模型评判器和困惑度过滤器生成。实验结果表明,我们的算法能将各项个体风险控制在接近目标值的水平。