Data-adaptive (machine learning-based) effect estimators are increasingly popular to reduce bias in high-dimensional bioinformatic and clinical studies (e.g. real-world data, target trials, -omic discovery). Their relative statistical efficiency (high power) is particularly invaluable in these contexts since sample sizes are often limited due to practical and cost concerns. However, these methods are subject to technical limitations that are dataset specific and involve computational trade-offs. Thus, it is challenging for analysts to identify when such methods may offer benefits or select amongst statistical methods. We present extensive simulation studies of several cutting-edge estimators, evaluating both performance and computation time. Critically, rather than use arbitrary simulation data, we generate synthetic datasets mimicking the observed data structure (plasmode simulation) of a real molecular epidemiologic cohort. We find that machine learning approaches may not always be indicated in such data settings, but that performance is highly context dependent. We present a user-friendly Shiny app REFINE2 (Realistic Evaluations of Finite sample INference using Efficient Estimators) that enables analysts to simulate synthetic data from their own datasets and directly evaluate the performance of several cutting-edge algorithms in those settings. This tool may greatly facilitate the proper selection and implementation of machine-learning-based effect estimators in bioinformatic and clinical study contexts.
翻译:数据适应(基于机械的学习)效应估计器越来越受欢迎,以减少高维生物信息学和临床研究(例如,真实世界数据、目标试验、工程发现)的偏差。它们相对的统计效率(高功率)在这些情况下特别宝贵,因为抽样规模往往因实际和成本问题而受到限制。然而,这些方法受到技术限制,而这种技术限制是特定数据集,涉及计算取舍。因此,分析师很难确定这些方法何时能提供效益或选择统计方法。我们对一些尖端估计器进行了广泛的模拟研究,评估了性能和计算时间。关键地,而不是使用任意的模拟数据,我们生成合成数据集,模拟了观察到的、模拟体积的规模(模拟模型模拟),以模拟真实的分子感化组群。我们发现,机器学习方法不一定在这种数据环境中有所显示,但业绩高度依赖上下文。我们提出了一个方便用户的Shiny AppedE2(对精密的临床采样估计进行实时和计算。在精确的模拟环境中,我们生成了精准的精确的精确的模拟模型,从而能够对自身进行模拟分析,从这些精确的精确的模拟分析,并直接地评估。