When approaching a clustering problem, choosing the right clustering algorithm and parameters is essential, as each clustering algorithm is proficient at finding clusters of a particular nature. Due to the unsupervised nature of clustering algorithms, there are no ground truth values available for empirical evaluation, which makes automation of the parameter selection process through hyperparameter tuning difficult. Previous approaches to hyperparameter tuning for clustering algorithms have relied on internal metrics, which are often biased towards certain algorithms, or having some ground truth labels available, moving the problem into the semi-supervised space. This preliminary study proposes a framework for semi-automated hyperparameter tuning of clustering problems, using a grid search to develop a series of graphs and easy to interpret metrics that can then be used for more efficient domain-specific evaluation. Preliminary results show that internal metrics are unable to capture the semantic quality of the clusters developed and approaches driven by internal metrics would come to different conclusions than those driven by manual evaluation.
翻译:当处理组群问题时,必须选择正确的群集算法和参数,因为每个组群算法都精于寻找特定性质的群集。由于群集算法的未经监督性质,没有可用于实证评估的地面真实值,这使得通过超参数调使参数选择过程自动化变得困难。以前对群集算法进行超参数调整的方法依赖于内部指标,这些指标往往偏向于某些算法,或存在一些地面的真象标签,将问题移入半监督空间。本初步研究提出一个半自动超参数组合问题调控框架,利用网格搜索来开发一系列图表,并易于解释用于更有效的具体领域评价的参数。初步结果显示,内部指标无法捕捉所开发的群集的语义质量和由内部指标驱动的方法,得出不同于由人工评价驱动的结论。