Stack Overflow (SO) platform has a huge dataset of questions and answers driven by interactions between users. But the count of unanswered questions is continuously rising. This issue is common across various community Question & Answering platforms (Q&A) such as Yahoo, Quora and so on. Clustering is one of the approaches used by these communities to address this challenge. Specifically, Intent-based clustering could be leveraged to answer unanswered questions using other answered questions in the same cluster and can also improve the response time for new questions. It is here, we propose SOCluster, an approach and a tool to cluster SO questions based on intent using a graph-based clustering approach. We selected four datasets of 10k, 20k, 30k & 40k SO questions without code-snippets or images involved, and performed intent-based clustering on them. We have done a preliminary evaluation of our tool by analyzing the resultant clusters using the commonly used metrics of Silhouette coefficient, Calinkski-Harabasz Index, & Davies-Bouldin Index. We performed clustering for 8 different threshold similarity values and analyzed the intriguing trends reflected by the output clusters through the three evaluation metrics. At 90% threshold similarity, it shows the best value for the three evaluation metrics on all four datasets. The source code and tool are available for download on Github at: https://github.com/Liveitabhi/SOCluster, and the demo can be found here: https://youtu.be/uyn8ie4h3NY.
翻译:Stack Overflow (SO) 平台拥有由用户之间互动驱动的大量问答数据集。 但是,未解问题的数量在不断上升。 这个问题在各种社区问答平台( {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{}}}}}}) 中很常见。 集群是这些社区用来应对这一挑战的方法之一。 具体来说, 基于本源的组群可以用同一组群中其他已解答的问题来回答未解的问题,还可以改进新问题的答复时间。 我们在这里建议SOCluster, 一种方法和一个工具来根据使用基于图表的群集法的意向来分组SO问题。 我们选择了4个10k, 20k, 30k & 40k SO(}}}}}{{{}}}{{{}}{{{}}}{{{{}}}{{{}}}{{{{}}}}}}}}}{{{{{{{}}}}}{{{{{{{{{{{{{{{{{{{{{{{{{}}}}}}}{{{{{{{{{{}}}}}}}}}}}}}}}}}____________________________。