Enhanced Distributed Text Document Clustering Based on Semantics


(*) Corresponding author


Authors' affiliations


DOI's assignment:
the author of the article can submit here a request for assignment of a DOI number to this resource!
Cost of the service: euros 10,00 (for a DOI)

Abstract


Distributed text document clustering is an emerging area that is used to improve quality in information retrieval and document organization in digital libraries. Enormous amount of data are available in large scale networks .So it is difficult to cluster data from a centralized location. A wide variety of distributed text document clustering algorithms are available for analyzing data from distributed sources. An enhanced distributed text document clustering algorithm (DEKLSI) has been proposed that uses an enhanced K-Means algorithm along with Latent Semantic Indexing (LSI) for increasing the quality and accuracy of the algorithm. Latent Semantic Indexing is used to cluster the documents based on semantics that deals with the problems of synonymy and polysemy. The results show improvement in the clustering quality and execution time thereby the accuracy. The performance of this enhanced algorithm is compared and analyzed with the clustering algorithm without semantics. The experiment is evaluated using two different document datasets such as 20NG, Reuters.
Copyright © 2013 Praise Worthy Prize - All rights reserved.

Keywords


Enhanced K-Means; Semantics; LSI; Distributed Document Clustering; Synonymy; Polysemy

Full Text:

PDF


References


Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques (Morgan Kaufmann Publishers, 2006).

Anna Huang, Similarity Measures for Text Document Clustering, Proceedings of the New Zealand Computer Science Research Student Conference, pp.49-56,2008.

Neethi Narayanan, J.E. Judith, J. Jayakumari, Enhanced distributed document clustering algorithm using different similarity measures ,IEEE Conference on Information & Communications Technologies(ICT),pp.545-550,2013.

Khaled M.Hammouda, Mohamed S.Kamel, Hierarchically distributed peer-to-peer document clustering and cluster summarization, IEEE transaction on Knowledge and Data Engineering, Vol.21 , n.5,pp.681-698,2009.

Souptik Datta, K. Bhaduri, Chris Giannella, Ran Wolff and Hillol Kargupta, Distributed Data Mining in Peer-to-Peer Networks , IEEE Internet Computing, pp.1-8,2006.

H.-C Hasio and C.T King, Similarity Discovery in structured P2P overlays, International Conference on Parallel Processing,pp.636-644,2003.

S.Datta, C.R. Giannella and H. Kargupta, Approximate distributed k-means clustering over P2P network, IEEE transaction on Knowledge and Data Engineering, Vol.21,n.10, pp.1372-1388,October 2009.

O. Papapetrou, W. Siberski, and W. Nejdl, Decentralized Probabilistic Text Clustering, IEEE transaction on Knowledge and Data Engineering ,Vol.24, n.10, pp.1848-1861,October 2012.

J.E. Judith, J. Jayakumari, Performance Evaluation of an effective hybrid distributed document clustering algorithm , European Journal of Scientific Research, Vol. 86,n.2, pp.283-297,September 2012.

M. Eisenhardt, W. Muller and A. Henrich, Classifying documents by distributed P2P clustering, in INFORMATIK,pp.286-291,2003.

Eshref Januzaj, Hans-Peter Kriegel and Martin Pfeifle, Towards Effective and Efficient Distributed Clustering, Workshop on Clustering large Data Sets,2003.

Chang Liu, Song-Nian Yu and Qiang Guo, Distributed document clustering for search engine, Proceedings of the International Conference on Wavelet Analysis and Pattern Recognition, pp.454-459,2009.

Qing He, Tingting Li, Fuzhen Zhuang, Zhongzhi Shi, Frequent Term based Peer-to-Peer Text clustering, International Symposium on Knowledge Acquisition and Modeling, pp. 352-355,2010.

Zhongju Deng, Wei Song, Xuefeng Zheng, P2PKMM:A Hybrid Clustering Algorithm over P2P Network, Third International Symposium on Intelligent Information Technology and Security Informatics,pp.450-454,2010.

Sridevi U.K, Nagaveni N, Semantically Enhanced Document Clustering Based on PSO Algorithm, European journal of Scientific Research,Vol.57,n.3,pp.485-493,2011.

M.F. Porter, An algorithm for suffix stripping, Program electronic library and information systems, Vol. 14, Iss.3, pp.130 - 137, 1980.

G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic Indexing, Communications of the, ACM, Vol.18, pp: 613–620,1975.

King, B. 1967,Step-wise clustering procedures, J. Am. Stat. Assoc., Vol.62, n.317,pp. 86–101,1967.

Barbara Rosario, Latent Semantic Indexing: An overview, Final Paper, Infosys 240, Spring 2000.

Jianxiong Yang,Watada.J,Decomposition of term-document matrix for cluster analysis,IEEE International Conference on Fuzzy Systems,pp.976-983,2011.

A. Amine, Z. Elberrichi, M. Simonet, WordNet-Based Text Clustering Methods: Evaluation and Comparative Study, IRECOS, Vol.8, n.8, 2013.

http://people.csail.imit.edu/jrennie/20Newsgroups/

D.D. Lewis, Reuters-21578 text categorization test collection distribution 1.0 http://www.research.att.com /lewis,1999.


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize