In text analysis, Spherical K-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse term-document matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce Spherical Double K-Means (SDKM), a novel methodology that simultaneously clusters documents and terms. This approach offers several advantages: first, by integrating the clustering of documents and terms, SDKM provides deeper insights into the relationships between content and vocabulary, enabling more effective topic identification and keyword extraction. Additionally, the two-level clustering assists in understanding both overarching themes and specific terminologies within document clusters, enhancing interpretability. SDKM effectively handles the high dimensionality and sparsity inherent in text data by utilizing cosine similarity, leading to improved computational efficiency. Moreover, the method captures dynamic changes in thematic content over time, making it well-suited for applications in rapidly evolving fields. Ultimately, SDKM presents a comprehensive framework for advancing text mining efforts, facilitating the uncovering of nuanced patterns and structures that are critical for robust data analysis. We apply SDKM to the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021. Our analysis reveals distinct clusters of words and documents that correspond to significant historical themes and periods, showcasing the method's ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.
翻译:暂无翻译