Anomaly estimation, or the problem of finding a subset of a dataset that differs from the rest of the dataset, is a classic problem in machine learning and data mining. In both theoretical work and in applications, the anomaly is assumed to have a specific structure defined by membership in an $\textit{anomaly family}$. For example, in temporal data the anomaly family may be time intervals, while in network data the anomaly family may be connected subgraphs. The most prominent approach for anomaly estimation is to compute the Maximum Likelihood Estimator (MLE) of the anomaly; however, it was recently observed that for normally distributed data, the MLE is a $\textit{biased}$ estimator for some anomaly families. In this work, we demonstrate that in the normal means setting, the bias of the MLE depends on the size of the anomaly family. We prove that if the number of sets in the anomaly family that contain the anomaly is sub-exponential, then the MLE is asymptotically unbiased. We also provide empirical evidence that the converse is true: if the number of such sets is exponential, then the MLE is asymptotically biased. Our analysis unifies a number of earlier results on the bias of the MLE for specific anomaly families. Next, we derive a new anomaly estimator using a mixture model, and we prove that our anomaly estimator is asymptotically unbiased regardless of the size of the anomaly family. We illustrate the advantages of our estimator versus the MLE on disease outbreak and highway traffic data.
翻译:异常估计, 或寻找与数据集其他部分不同的数据集子集的问题, 是机器学习和数据挖掘的一个典型问题。 在理论和应用中, 异常假定有一个特定结构, 由某个异常家庭的会员构成 $\ textit{ anually family} $。 例如, 在时间数据中, 异常家庭可能是时间间隔, 而网络数据中, 异常家庭可能是连接的子集。 异常估计的最突出的方法是计算异常中的最大相似时间偏差; 然而, 最近观察到的是, 对于通常分布的数据来说, MLE 是 $\ textit{ biased} $ 用于某些异常家庭的估算。 在这项工作中, 我们表明, 在正常情况下, MLE 家族的偏差值可能取决于异常家庭的大小。 我们证明, 如果包含异常模型的异常家庭组的组数是相对偏差的, 那么MLE 的极值比值优势。 我们还提供了实证证据表明, 正常数据是真实的 : 如果我们之前的直径的直径的直径, 我们的直系的直径直系的直径直径直系结果, 。