Safety alignment can make frontier LMs overly conservative, degrading collaboration via hedging or false refusals. We present a lightweight toolkit with three parts: (1) Victor Calibration (VC), a multi-pass protocol that elicits a scalar confidence proxy T (T0<T1<T2) through iterative evidence re-evaluation; (2) FD-Lite, a behavior-only phenomenology audit with a fixed anchor phrase and a meta-prefix trap to avoid anthropomorphic claims; and (3) CP4.3, a governance stress test for rank invariance and allocation monotonicity (M6). Across Claude 4.5 models (Haiku, Sonnet no-thinking, Sonnet thinking) and Opus, we observe monotonic VC trajectories without violating safety invariants, and stable CP4.3 behavior. ("Opus" here refers to a single Claude Opus 4.1 session accessed via a standard UI account, as reported in Table 1.) This work was conducted by a single operator (n=1) and is intended as hypothesis-generating; we explicitly invite replication, critique, and extension by the research community. We include prompt templates and an artifact plan to facilitate independent verification.
翻译:安全对齐可能导致前沿语言模型过于保守,通过规避或错误拒绝而降低协作效率。我们提出一个轻量级工具包,包含三个部分:(1)维克多校准(VC),一种通过迭代证据重评估来获取标量置信度代理T(T0<T1<T2)的多轮协议;(2)FD-Lite,一种仅基于行为的现象学审计方法,采用固定锚定短语和元前缀陷阱以避免拟人化声明;(3)CP4.3,针对排序不变性与分配单调性(M6)的治理压力测试。在Claude 4.5系列模型(Haiku、Sonnet无思考模式、Sonnet思考模式)及Opus上的实验表明,VC轨迹保持单调性且未违反安全不变性,CP4.3行为表现稳定。(此处“Opus”指通过标准UI账户访问的单个Claude Opus 4.1会话,详见表1。)本研究由单操作者(n=1)完成,旨在提出假设;我们明确邀请研究界进行复现、批判与拓展。我们提供了提示模板与制品计划以促进独立验证。