The growing availability of observational databases like electronic health records (EHR) provides unprecedented opportunities for secondary use of such data in biomedical research. However, these data can be error-prone and need to be validated before use. It is usually unrealistic to validate the whole database due to resource constraints. A cost-effective alternative is to implement a two-phase design that validates a subset of patient records that are enriched for information about the research question of interest. Herein, we consider odds ratio estimation under differential outcome and exposure misclassification. We propose optimal designs that minimize the variance of the maximum likelihood odds ratio estimator. We develop a novel adaptive grid search algorithm that can locate the optimal design in a computationally feasible and numerically accurate manner. Because the optimal design requires specification of unknown parameters at the outset and thus is unattainable without prior information, we introduce a multi-wave sampling strategy to approximate it in practice. We demonstrate the efficiency gains of the proposed designs over existing ones through extensive simulations and two large observational studies. We provide an R package and Shiny app to facilitate the use of the optimal designs.
翻译:越来越多的观察数据库,如电子健康记录(EHR),为生物医学研究中二次使用这类数据提供了前所未有的机会。然而,这些数据可能容易出错,在使用前需要验证,由于资源有限,验证整个数据库通常不切实际。一个成本效益高的替代办法是实施一个两阶段设计,验证一组病人记录,这些记录丰富了有关研究问题的资料。在这里,我们考虑在差别结果和暴露分类错误下对差异比率进行估计。我们提出最佳设计,尽量减少最大可能性概率估计值的差异。我们开发了一种新的适应性电网搜索算法,能够以计算可行和数字准确的方式找到最佳设计。由于最佳设计首先需要说明未知参数,因此在没有事先资料的情况下是无法实现的,我们采用了多波取样战略,以便在实践中加以估计。我们通过广泛的模拟和两次大型观测研究,展示了拟议设计对现有设计的效率收益。我们提供了一套R包和Shiny App,以便利最佳设计的使用。