A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in some real-world scenarios. In this paper, we propose an importance sampling based deterministic annealing approach (ISDA) for clustering problems which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method and be solved by the quasi-newton algorithm. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed ISDA. Furthermore, we show that fuzzy c-means is a special case of ISDA with the logarithmic distortion. This observation sheds a new light on the relationship between fuzzy c-means and deterministic annealing clustering algorithms and provides an interesting physical and information-theoretical interpretation for fuzzy exponent $m$.
翻译:多数组群方法目前的一个假设是,培训数据和今后数据来自同一分布,但这一假设可能在某些现实世界情景中无法维持。在本文中,我们建议对集群问题采用基于抽样的确定性肛门法(ISDA),以尽量减少在分布偏差的限制下预计会出现的最坏扭曲情况。分布偏差限制可以转换为对以重要取样得出的统一分布为核心的一组重量分布的限制。拟议方法的目标是尽量减少最大降解下的损失,从而造成问题:一个有限的微型最大优化问题,它可以用拉格朗方法重新拟订成一个不受控制的问题,并通过准牛顿算法加以解决。合成数据集的实验结果和真实世界负荷预测问题证实了拟议的ISDA的有效性。此外,我们表明模糊的C-手段是ISDAD在逻辑扭曲下的一个特殊案例。这一观察揭示了烟雾用手段和确定性耐敏性麻醉剂的Annealdalgy 算法之间的关系,提供了有趣的物理和信息节制磁力分析。