High levels of missing data and strong class imbalance are ubiquitous challenges that are often presented simultaneously in real-world time series data. Existing methods approach these problems separately, frequently making significant assumptions about the underlying data generation process in order to lessen the impact of missing information. In this work, we instead demonstrate how a general self-supervised training method, namely Autoregressive Predictive Coding (APC), can be leveraged to overcome both missing data and class imbalance simultaneously without strong assumptions. Specifically, on a synthetic dataset, we show that standard baselines are substantially improved upon through the use of APC, yielding the greatest gains in the combined setting of high missingness and severe class imbalance. We further apply APC on two real-world medical time-series datasets, and show that APC improves the classification performance in all settings, ultimately achieving state-of-the-art AUPRC results on the Physionet benchmark.
翻译:现有方法分别处理这些问题,经常对基本数据生成过程作出重大假设,以减轻缺失信息的影响。在这项工作中,我们相反地展示了如何利用普遍自我监督的培训方法,即自动递减预测编码(CPC),在没有强有力的假设的情况下,同时克服缺失数据和阶级失衡现象。具体地说,在合成数据集方面,我们表明,标准基线通过使用APC大大改进,在高度缺失和严重阶级失衡的综合设置中取得了最大收益。我们进一步将APC应用于两个真实世界的医疗时间序列数据集,并表明APC改善了所有环境的分类业绩,最终在Physionet基准上实现了AUPRC的最新成果。