While there is a rich literature on robust methodologies for contamination in continuously distributed data, contamination in categorical data is largely overlooked. This is regrettable because many datasets are categorical and oftentimes suffer from contamination. Examples include inattentive responding and bot responses in questionnaires or zero-inflated count data. We propose a novel class of contamination-robust estimators of models for categorical data, coined $C$-estimators (``$C$'' for categorical). We show that the countable and possibly finite sample space of categorical data results in non-standard theoretical properties. Notably, in contrast to classic robustness theory, $C$-estimators can be simultaneously robust \textit{and} fully efficient at the postulated model. In addition, a certain particularly robust specification fails to be asymptotically Gaussian at the postulated model, but is asymptotically Gaussian in the presence of contamination. We furthermore propose a diagnostic test to identify categorical outliers and demonstrate the enhanced robustness of $C$-estimators in a simulation study.
翻译:暂无翻译