Unsupervised learning seeks to uncover patterns in data. However, different kinds of noise may impede the discovery of useful substructure from real-world time-series data. In this work, we focus on mitigating the interference of left-censorship in the task of clustering. We provide conditions under which clusters and left-censorship may be identified; motivated by this result, we develop a deep generative, continuous-time model of time-series data that clusters while correcting for censorship time. We demonstrate accurate, stable, and interpretable results on synthetic data that outperform several benchmarks. To showcase the utility of our framework on real-world problems, we study how left-censorship can adversely affect the task of disease phenotyping, resulting in the often incorrect assumption that longitudinal patient data are aligned by disease stage. In reality, patients at the time of diagnosis are at different stages of the disease -- both late and early due to differences in when patients seek medical care and such discrepancy can confound unsupervised learning algorithms. On two clinical datasets, our model corrects for this form of censorship and recovers known clinical subtypes.
翻译:然而,不同种类的噪音可能阻碍从现实世界的时间序列数据中发现有用的次级结构。在这项工作中,我们注重减轻左派检查对集群任务的干扰。我们提供可以识别集群和左派检查的条件;由于这一结果,我们开发了一个深度的基因化、连续时间的时间序列数据模型,在为审查时间进行校正的同时,对时间序列数据进行分组校正。在合成数据方面,我们展示了准确、稳定和可解释的结果,这些结果超过了若干基准。为了展示我们关于现实世界问题的框架的效用,我们研究了左派检查如何会对疾病流行的任务产生不利影响,导致往往错误的假设,即纵向病人数据是按疾病阶段排列的。在现实中,病人在诊断时处于不同的疾病阶段,由于病人寻求医疗护理时出现差异,这种差异会逐渐地混淆出非受监督的学习算法。在两个临床数据集中,我们对这种检查和复原已知临床子型模式的正确性。