Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. A predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an optimal distance for mixed-type data is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric that uses mixed kernels to measure dissimilarity, with cross-validated optimal kernel bandwidths. Our approach improves clustering accuracy when utilized for existing distance-based clustering algorithms on simulated and real-world datasets containing pure continuous, categorical, and mixed-type data.
翻译:暂无翻译