Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods. We use a sequential regression/ chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with results from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures with efficiency gains. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large.
翻译:多重估算(MI)是处理多变量数据集中缺失的数据的流行和既定方法,但在大规模和复杂数据集中使用的这一方法的实用性受到质疑,其中一组数据是收入动态小组研究(PSID),这是对美国家庭收入和财富的长期和广泛调查。本次调查的缺失数据目前使用传统的热甲板方法处理。我们使用软件IVEware, 将2013年PSID的跨部门财富数据填充成数,并将由此得出的估算数据的分析与当前热甲板方法的结果进行比较。介绍了实际困难,如非正常分布变量、跳动模式、多个层次的绝对变量和多线性,以及我们克服这些困难的方法。我们用内部诊断和外部基准数据来评估估算质量和有效性。MI通过帮助保持相关性和增效,改进了现有热甲方法。我们建议实际实施MI,并期望在缺失信息的比例较大时取得更大收益。