Quantitative analysis of large-scale data is often complicated by the presence of diverse subgroups, which reduce the accuracy of inferences they make on held-out data. To address the challenge of heterogeneous data analysis, we introduce DoGR, a method that discovers latent confounders by simultaneously partitioning the data into overlapping clusters (disaggregation) and modeling the behavior within them (regression). When applied to real-world data, our method discovers meaningful clusters and their characteristic behaviors, thus giving insight into group differences and their impact on the outcome of interest. By accounting for latent confounders, our framework facilitates exploratory analysis of noisy, heterogeneous data and can be used to learn predictive models that better generalize to new data. We provide the code to enable others to use DoGR within their data analytic workflows.
翻译:大型数据的数量分析往往因存在各种分组而变得复杂,这些分组降低了它们就搁置数据所作的推断的准确性。为了应对不同数据分析的挑战,我们引入了DoGR, 这种方法通过同时将数据分成重叠的组群(分解)和模拟它们内部的行为(回归)来发现潜在混淆者。当应用于现实世界数据时,我们的方法发现有意义的组群及其特征行为,从而洞察到群体差异及其对利益结果的影响。通过计算潜在混淆者,我们的框架便于对噪音、混杂数据进行探索性分析,并可用于学习更好地概括新数据的预测模型。我们提供代码,使其他人能够在其数据分析工作流程中使用DoGR。