In this study, we examine a clustering problem in which the covariates of each individual element in a dataset are associated with an uncertainty specific to that element. More specifically, we consider a clustering approach in which a pre-processing applying a non-linear transformation to the covariates is used to capture the hidden data structure. To this end, we approximate the sets representing the propagated uncertainty for the pre-processed features empirically. To exploit the empirical uncertainty sets, we propose a greedy and optimistic clustering (GOC) algorithm that finds better feature candidates over such sets, yielding more condensed clusters. As an important application, we apply the GOC algorithm to synthetic datasets of the orbital properties of stars generated through our numerical simulation mimicking the formation process of the Milky Way. The GOC algorithm demonstrates an improved performance in finding sibling stars originating from the same dwarf galaxy. These realistic datasets have also been made publicly available.
翻译:在本研究中,我们研究一个组群问题,即数据集中每个元素的共变与该元素特有的不确定性相关联。更具体地说,我们考虑一个组群办法,即使用对共变体进行非线性变换的预处理方法来捕捉隐藏的数据结构。为此,我们根据经验将代表预处理特性所传播的不确定性的组群加以比较。为了利用经验不确定性组群,我们提议一种贪婪和乐观的组群算法,在这类组群中找到更好的性能选择对象,产生更精密的组群。作为一项重要应用,我们将GOC算法应用于通过我们模拟银河形成过程的数字模拟生成的恒星轨道特性的合成数据集。GOC算法表明,在寻找来自同一矮星系的硅星体方面,性能有所改善。这些现实的数据集也被公诸于众。