In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next. For example, in the protein design problem, we have a regression model that predicts some real-valued property of a protein sequence, which we use to propose new sequences believed to exhibit higher property values than observed in the training data. Since validating designed sequences in the wet lab is typically costly, it is important to know how much we can trust the model's predictions. In such settings, however, there is a distinct type of distribution shift between the training and test data: one where the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data -- that is, the designed sequences -- has some non-trivial relationship with its error on the training data. Herein, we introduce a method to quantify predictive uncertainty in such settings. We do so by constructing confidence sets for predictions that account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any prediction algorithm, even when a trained model chooses the test-time input distribution. As a motivating use case, we demonstrate how our method quantifies uncertainty for the predicted fitness of designed protein using real data sets.
翻译:在许多实际世界的机器学习部署中,我们使用一种预测算法来选择下一个测试的数据。例如,在蛋白质设计问题中,我们有一个回归模型来预测蛋白序列的某些实际价值属性,我们用这个模型来提出被认为比培训数据中观察到的更具有更高属性的新序列。由于验证湿实验室的设计序列通常成本很高,因此重要的是要知道我们能够相信模型的预测。但是,在这样的环境下,培训和测试数据之间有一种不同的分布变化类型:即培训和测试数据在统计上取决于数据,因为后者是根据前者选择的。因此,测试数据中的模型错误 -- -- 即设计序列 -- -- 与培训数据中的错误有某种非三维关系。我们在这里引入一种方法来量化这种环境中的预测不确定性。我们这样做的方法是建立一套信任,用来预测培训和测试数据之间的依赖性。我们构建的信任组合有一定的缩写保证,保证任何预测值的算法都是基于前者选择的。