We consider the following problem: we have a large dataset of normal data available. We are now given a new, possibly quite small, set of data, and we are to decide if these are normal data, or if they are indicating a new phenomenon. This is a novelty detection or out-of-distribution detection problem. An example is in medicine, where the normal data is for people with no known disease, and the new dataset people with symptoms. Other examples could be in security. We solve this problem by training a bidirectional generative adversarial network (BiGAN) on the normal data and using a Gaussian graphical model to model the output. We then use universal source coding, or minimum description length (MDL) on the output to decide if it is a new distribution, in an implementation of Kolmogorov and Martin-L\"{o}f randomness. We apply the methodology to both MNIST data and a real-world electrocardiogram (ECG) dataset of healthy and patients with Kawasaki disease, and show better performance in terms of the ROC curve than similar methods.
翻译:我们考虑了以下问题:我们拥有大量正常数据的数据集。我们现在得到的是一套新的、可能相当小的数据集,我们将决定这些数据是正常数据,还是表明一种新的现象。这是一个新发现或分配外的检测问题。一个例子是医学领域,正常数据是针对没有已知疾病的人的,而新的数据集有症状的人。其他例子可能是安全的。我们通过在正常数据上培训双向基因对抗网络(BiGAN)来解决这个问题,并使用高斯图形模型来模拟输出。我们然后使用通用源码编码,或最小描述长度(MDL)来决定产出是否是一种新的分布,在科多洛夫和马丁-L\\{o}f 随机性实施中,我们把方法应用于MNIST数据和真世界彩色图(ECG)数据集,对患川崎病的健康和病人来说,并显示ROC曲线的性能优于类似方法。