This research proposes a data segmentation algorithm which combines t-SNE, DBSCAN, and Random Forest classifier to form an end-to-end pipeline that separates data into natural clusters and produces a characteristic profile of each cluster based on the most important features. Out-of-sample cluster labels can be inferred, and the technique generalizes well on real data sets. We describe the algorithm and provide case studies using the Iris and MNIST data sets, as well as real social media site data from Instagram. This is a proof of concept and sets the stage for further in-depth theoretical analysis.
翻译:这项研究建议采用数据分离算法,将t-SNE、DBSCAN和随机森林分类法结合起来,形成一条端到端管道,将数据分为自然集群,并根据最重要的特征产生每个集群的特点。可以推断出群集外标签,技术在真实数据集上非常概括。我们用Iris和MNIST数据集以及Instagram的真社交媒体站点数据描述算法并提供案例研究。这是概念的证明,为进一步深入的理论分析奠定了基础。