Differential expression (DE) plays a fundamental role toward illuminating the molecular mechanisms driving a difference between groups (e.g., due to treatment or disease). While any analysis is run on particular cells/samples, the intent is to generalize to future occurrences of the treatment or disease. Implicitly, this step is justified by assuming that present and future samples are independent and identically distributed from the same population. Though this assumption is always false, we hope that any deviation from the assumption is small enough that A) conclusions of the analysis still hold and B) standard tools like standard error, significance, and power still reflect generalizability. Conversely, we might worry about these deviations, and reliance on standard tools, if conclusions could be substantively changed by dropping a very small fraction of data. While checking every small fraction is computationally intractable, recent work develops an approximation to identify when such an influential subset exists. Building on this work, we develop a metric for dropping-data robustness of DE; namely, we cast the analysis in a form suitable to the approximation, extend the approximation to models with data-dependent hyperparameters, and extend the notion of a data point from a single cell to a pseudobulk observation. We then overcome the inherent non-differentiability of gene set enrichment analysis to develop an additional approximation for the robustness of top gene sets. We assess robustness of DE for published single-cell RNA-seq data and discover that 1000s of genes can have their results flipped by dropping <1% of the data, including 100s that are sensitive to dropping a single cell (0.07%). Surprisingly, this non-robustness extends to high-level takeaways; half of the top 10 gene sets can be changed by dropping 1-2% of cells, and 2/10 can be changed by dropping a single cell.
翻译:暂无翻译