Enhanced Distributed Text Document Clustering Based on Semantics

Distributed text document clustering is an emerging area that is used to improve quality in information retrieval and document organization in digital libraries. Enormous amount of data are available in large scale networks .So it is difficult to cluster data from a centralized location. A wide variety of distributed text document clustering algorithms are available for analyzing data from distributed sources. An enhanced distributed text document clustering algorithm (DEKLSI) has been proposed that uses an enhanced K-Means algorithm along with Latent Semantic Indexing (LSI) for increasing the quality and accuracy of the algorithm. Latent Semantic Indexing is used to cluster the documents based on semantics that deals with the problems of synonymy and polysemy. The results show improvement in the clustering quality and execution time thereby the accuracy. The performance of this enhanced algorithm is compared and analyzed with the clustering algorithm without semantics. The experiment is evaluated using two different document datasets such as 20NG, Reuters.
Enhanced K-Means; Semantics; LSI; Distributed Document Clustering; Synonymy; Polysemy

