The traditional apriori algorithm can be used for clustering the web documents based on the association technique of data mining. But this algorithm has several limitations due to repeated database scans and its weak association rule analysis. In modern world of large databases, efficiency of traditional apriori algorithm would reduce manifolds. In this paper, we proposed a new modified apriori approach by cutting down the repeated database scans and improving association analysis of traditional apriori algorithm to cluster the web documents. Further we improve those clusters by applying Fuzzy C-Means (FCM), K-Means and Vector Space Model (VSM) techniques separately. For experimental purpose, we use Classic3 and Classic4 datasets of Cornell University having more than 10,000 documents and run both traditional apriori and our modified apriori approach on it. Experimental results show that our approach outperforms the traditional apriori algorithm in terms of database scan and improvement on association of analysis. We found out that FCM is better than K-Means and VSM in terms of F-measure of clusters of different sizes.
 翻译:传统的优先算法可以用来根据数据挖掘的关联技术对网络文件进行分组。但是,由于反复的数据库扫描及其薄弱的联系规则分析,这种算法有几个局限性。在大型数据库的现代世界中,传统的优先算法的效率会降低多个元数。在本文中,我们提议了一种新的经修改的优先算法,办法是减少重复的数据库扫描,改进传统优先算法的联系分析,将网络文件集中起来。我们通过分别应用Fuzzy C-Means(FCM)、K-Means(K-Means)和Vctor空间模型(VSM)技术来进一步改进这些组群。为了实验的目的,我们使用了Cornell大学的经典3和经典4数据集,这些数据集拥有10,000多份文件,同时运行传统的优先算法和我们经过修改的优先算法。实验结果表明,我们的方法在数据库扫描和改进联合分析方面超越了传统的优先算法。我们发现,就不同规模的组的F计量而言,FCMM比K-Mes和VSM(VSM)要好得多。