Many applications of machine learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, one data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences, then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet lab is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data in the design setting -- one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data -- that is, the designed sequences -- has an unknown and possibly complex relationship with its error on the training data. We introduce a method to quantify predictive uncertainty in such settings. We do so by constructing confidence sets for predictions that account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any prediction algorithm, even when a trained model chooses the test-time input distribution. As a motivating use case, we demonstrate with several real data sets how our method quantifies uncertainty for the predicted fitness of designed proteins, and can therefore be used to select design algorithms that achieve acceptable trade-offs between high predicted fitness and low predictive uncertainty.
翻译:机器学习方法的许多应用都包含一个迭代协议,其中收集数据,一个模型经过培训,然后该模型的产出被用于选择下一个要考虑的数据。例如,设计蛋白质的一种数据驱动方法是培训回归模型,以预测蛋白序列是否适合蛋白序列,然后使用该模型提出被认为比培训数据中观察到的更适合的新序列。由于验证湿实验室的设计序列通常费用很高,因此必须量化模型低预测的不确定性。由于设计设置中的培训和测试数据之间的分配变化特点类型 -- -- 即培训和测试数据在统计上取决于数据,例如后者基于前者选择。因此,测试数据中的模型错误 -- -- 即设计序列 -- -- 与培训数据上的错误有着未知和可能复杂的关系。我们采用一种方法来量化这种环境中的预测不确定性。我们这样做的方法是建立一套信心,用于预测培训和测试数据之间的依赖性。因此,我们构建一套信心,即培训和测试数据在统计中取决于统计上的不确定性,即培训和测试数据在统计过程中,我们所构建的定值和测试数据取决于统计的可靠性,因此,我们所构建的定序-Asample保证,在测试数据上设置一个用于任何预测的预测的预测的数值时,在选择一个经过培训的预测时,如何使用某种预测时,使用一种精确的计算方法。