Autoencoders have been proposed as a powerful tool for model-independent anomaly detection in high-energy physics. The operating principle is that events which do not belong to the space of training data will be reconstructed poorly, thus flagging them as anomalies. We point out that in a variety of examples of interest, the connection between large reconstruction error and anomalies is not so clear. In particular, for data sets with nontrivial topology, there will always be points that erroneously seem anomalous due to global issues. Conversely, neural networks typically have an inductive bias or prior to locally interpolate such that undersampled or rare events may be reconstructed with small error, despite actually being the desired anomalies. Taken together, these facts are in tension with the simple picture of the autoencoder as an anomaly detector. Using a series of illustrative low-dimensional examples, we show explicitly how the intrinsic and extrinsic topology of the dataset affects the behavior of an autoencoder and how this topology is manifested in the latent space representation during training. We ground this analysis in the discussion of a mock "bump hunt" in which the autoencoder fails to identify an anomalous "signal" for reasons tied to the intrinsic topology of $n$-particle phase space.
翻译:在高能物理学中,提出了自动编码器,作为在高能物理中进行模型独立的异常探测的强大工具。操作原则是,不属于培训数据空间的事件将重建得不好,因此将它们标记为异常。我们指出,在各种感兴趣的例子中,大型重建错误和异常之间的联系并不十分明确。特别是,对于具有非三维地形学的数据集来说,由于全球性问题,总是会出现错误地看似异常的点。相反,神经网络通常具有感应偏差,或者在地方内插之前,因此,尽管实际存在所希望的异常现象,但不属于培训数据空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间空间网络可能会发生微小的错误,而这种表层学如何表现。我们把这些分析放在模拟的“顶层搜索”的“顶层图像”中进行模拟的“顶层搜索”中。