We present a mathematical framework and computational methods to optimally design a finite number of sequential experiments. We formulate this sequential optimal experimental design (sOED) problem as a finite-horizon partially observable Markov decision process (POMDP) in a Bayesian setting and with information-theoretic utilities. It is built to accommodate continuous random variables, general non-Gaussian posteriors, and expensive nonlinear forward models. sOED then seeks an optimal design policy that incorporates elements of both feedback and lookahead, generalizing the suboptimal batch and greedy designs. We solve for the sOED policy numerically via policy gradient (PG) methods from reinforcement learning, and derive and prove the PG expression for sOED. Adopting an actor-critic approach, we parameterize the policy and value functions using deep neural networks and improve them using gradient estimates produced from simulated episodes of designs and observations. The overall PG-sOED method is validated on a linear-Gaussian benchmark, and its advantages over batch and greedy designs are demonstrated through a contaminant source inversion problem in a convection-diffusion field.
翻译:我们提出了一个数学框架和计算方法,以最佳的方式设计数量有限的连续实验实验。我们把这一顺序最佳实验设计(sOED)问题设计成一种在巴伊西亚环境中和通过信息理论工具部分观测到的有限偏差Markov 决策程序(POMDP ), 用于容纳连续随机变量、一般非古西亚的外星体和昂贵的非线性前方模型。 然后,SOED 寻求一种最佳设计政策,其中既包括反馈要素,也包括外观要素,将次优的批量和贪婪设计加以概括。我们通过强化学习的政策梯度(PG)方法从数字上解决 sOED 政策,并得出和证明 sOED 的 PG 表达方式。我们采用一种行为者-critic 方法,使用深层神经网络对政策和价值功能进行参数化,并利用模拟设计和观察过程产生的梯度估计来改进它们。总体PG-SOED方法,在线-Gusian基准上验证,其优于批量和贪婪设计的好处通过对等污染物源的问题在等化场中展示。