Our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the Eliciting Latent Knowledge problem. Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want (e.g. humans) rather than the things we don't (e.g. malign AIs). Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models.
翻译:我们的意图是提供一个明确的参考,说明在无法解决Elibing Lentn知识问题的情况下,安全地使用基因化/预测性模型需要怎样才能安全地利用基因化/预测性模型。此外,我们认为,大型语言模型可以被理解为世界的预测性模型,而这种概念性模型通过仔细调整这些模型来为其安全而有力的使用带来重要的机会,从而可以预测理想产出。不幸的是,这种方法还提出了各种潜在的致命安全问题,特别是预测性模型预测其他AI系统产出的周围情况,可能不为我们所知。然而,这些问题有许多潜在的解决办法,主要是通过仔细调整模型来预测我们所想要的东西(例如人类),而不是我们不想要的东西(例如恶意的AIs)。此外,由于预测目标简单,我们认为,预测性模型是我们所了解的最容易的内在结合问题。因此,我们认为,预测性模型的调整方法代表了从大型语言模型和其他类似的未来模型中获取人的水平和略微超人能力的最安全的方法。