Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available. Inadequate handling of missing values can lead to biased results and loss of statistical power in analysis. Existing models usually do not consider privacy concerns or do not utilise the inherent correlations across multiple features to impute the missing values. In healthcare applications, we are usually confronted with high dimensional and sometimes small sample size datasets that need more effective augmentation or imputation techniques. Besides, imputation and augmentation processes are traditionally conducted individually. However, imputing missing values and augmenting data can significantly improve generalisation and avoid bias in machine learning models. A Bayesian approach to impute missing values and creating augmented samples in high dimensional healthcare data is proposed in this work. We propose folded Hamiltonian Monte Carlo (F-HMC) with Bayesian inference as a more practical approach to process the cross-dimensional relations by applying a random walk and Hamiltonian dynamics to adapt posterior distribution and generate large-scale samples. The proposed method is applied to a cancer symptom assessment dataset and confirmed to enrich the quality of data in precision, accuracy, recall, F1 score, and propensity metric.
翻译:几乎所有临床研究都存在缺失的数值,因为没有收集或无法获得变量或问题的数据。对缺失值的处理不当可能导致偏差结果和在分析中丧失统计力量。现有模型通常不考虑隐私问题,或没有利用多种特征的内在关联来估算缺失值。在医疗保健应用中,我们通常面临需要更有效增强或估算技术的高维,有时是小型样本数据集。此外,估算和增强过程传统上是单独进行的。但是,估算缺失值和增强数据可以大大改善机器学习模型中的概括性并避免偏差。在这项工作中建议采用巴伊西亚人的方法,在高维度保健数据中估算缺失值和创建增强样本。我们建议用贝伊人推理推理的折叠汉密尔顿·蒙特卡洛(F-HMC)作为处理跨维度关系的更实用的方法,采用随机行走和汉密尔顿动力来调整远地点分布并生成大型样本。拟议的方法适用于癌症症状评估数据集,并被确认用于丰富精确度、准确度、回顾、F1分数和测量度的数据质量。