This paper addresses the situation in which treatment effects are reported using educational or psychological outcome measures comprised of multiple questions or "items." A distinction is made between a treatment effect on the construct being measured, which is referred to as impact, and item-specific treatment effects that are not due to impact, which are referred to as differential item functioning (DIF). By definition, impact generalizes to other measures of the same construct (i.e., measures that use different items), while DIF is dependent upon the specific items that make up the outcome measure. To distinguish these two cases, two estimators of impact are compared: an estimator that naively aggregates over items, and a less efficient one that is highly robust to DIF. The null hypothesis that both are consistent estimators of the true treatment impact leads to a Hausman-like specification test of whether the naive estimate is affected by item-level variation that would not be expected to generalize beyond the specific outcome measure used. The performance of the test is illustrated with simulation studies and a re-analysis of 34 item-level datasets from 22 randomized evaluations of educational interventions. In the empirical example, the dependence of reported effect sizes on the type of outcome measure (researcher-developed or independently developed) was substantially reduced after accounting for DIF. Implications for the ongoing debate about the role of researcher-developed assessments in education sciences are discussed.
翻译:暂无翻译