Effective Clustering of Text Documents in Low Dimension Space Using Semantic Association Among Terms

Nalluri Sivaram Prasad; K. Rajasekhara Rao

doi:10.15866/irecos.v10i5.5847

Effective Clustering of Text Documents in Low Dimension Space Using Semantic Association Among Terms

Nalluri Sivaram Prasad^(1*), K. Rajasekhara Rao⁽²⁾

^(*) Corresponding author

Authors' affiliations

DOI: https://doi.org/10.15866/irecos.v10i5.5847

Abstract

Sparse and high dimensional document representation of the popular Vector Space Model results in poor clustering performance. Dimension reduction techniques are useful for dense and low dimensional representation of documents that enhances clustering performance. This paper proposes a novel unsupervised filter method for feature selection. Filter methods assign weights to terms, used for representation of documents in the collection, according to some criterion, which is different from clustering task. Unsupervised feature selection methods do not use class labels to guide the selection of features. The proposed method assigns a score to a term, which is proportional to the term’s overall semantic association with rest of the terms in the document collection. The overall semantic association of a term is estimated using the co-occurrence frequencies of the term with other terms in the collection. Clustering results on three ideal text data sets TDT2, Reuters21578 and 20 Newsgroups proved that the proposed method selects features that are more discriminative, to separate intrinsic classes of documents, when compared with that selected by the existing unsupervised filter based feature section methods

Keywords

Filter Method; Co-occurrence Frequency; Semantic Association; Term; Text Clustering; Unsupervised Feature Selection

Full Text:

PDF

References

Huang, A., Similarity measures for text document clustering, Proceedings of the 6th New Zealand Computer Science Research Student Conference (Page: 49-56 Year of Publication: 2008).

G. Salton, A. Wong, C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, n. 11, pp. 613-620, 1975.
http://dx.doi.org/10.1145/361219.361220

Zeimpekis, D., Gallopoulos, E., TMG: A MATLAB toolbox for generating term-document matrices from text collections, In Kogan, J., Nicholas, C.H., Teboulle M., Grouping multidimensional data (Berlin Heidelberg : Springer-Verlag 2006, 187-210).
http://dx.doi.org/10.1007/3-540-28349-8_7

Ke, W., Information-theoretic term weighting schemes for document clustering and classification. International Journal on Digital Libraries, pp. 1-15, 2014.
http://dx.doi.org/10.1007/s00799-014-0121-3

M. Keikha, N. S. Razavian, F. Oroumchian, H. S. Razi, Document representation and quality of text: An Analysis , In M. W. Berry, M Castellanos, Survey of Text Mining II, (London: Springer-Verlag, 2008, 219-232).
http://dx.doi.org/10.1007/978-1-84800-046-9_12

Wilbur, W. J., Sirotkin, K., The automatic identification of stop words, Journal of information science, Vol. 18, n.1, pp. 45-55, 1992.
http://dx.doi.org/10.1177/016555159201800106

M. F. Porter, An algorithm for suffix stripping, Program: electronic library and information systems, Vol. 40, n. 3, pp. 211-218, 2006.
http://dx.doi.org/10.1108/00330330610681286

Zapata Becerra, A. A., Escuela de Idiomas Modernos, In Zapata Becerra, A. A., A Handbook of general and applied linguistics, (Venezuela: Trabajo de ascenso sin publicar, 2000).

P. Cunningham, Dimension Reduction, In M. Cord, P. Cunningham, Machine learning techniques for multimedia, (Berlin Heidelberg: Springer-Verlag, 2008, 91-112).
http://dx.doi.org/10.1007/978-3-540-75171-7_4

Dy, J. G., Brodley, C. E., Feature selection for unsupervised learning, The Journal of Machine Learning Research, Vol. 5, pp. 845-889, 2004.

Zhao, Z., Liu, H., Semi-supervised Feature Selection via Spectral Analysis, Proceedings of the 7th SIAM International Conference on Data Mining (Page: 641-646 Year of Publication: 2007).
http://dx.doi.org/10.1137/1.9781611972771.75

Robnik-Šikonja, M., Kononenko, I., Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning, Vol. 53, n. 1-2, pp. 23-69, 2003.
http://dx.doi.org/10.1023/a:1025667309714

A. L. Blum, P. Langley, Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, Vol. 97, n. 1, pp. 245-271, 1997.
http://dx.doi.org/10.1016/s0004-3702(97)00063-5

R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial intelligence, Vol. 97, n. 1, pp. 273-324, 1997.
http://dx.doi.org/10.1016/s0004-3702(97)00043-x

Cantupaz, E., Newsam, S., Kamath, C., Feature Selection in Scientific Applications, Proceedings of the 10th ACM SIGKDD international conference on Knowledge Discovery and Data mining (Page: 788-793 Year of Publication: 2004).
http://dx.doi.org/10.1145/1014052.1016915

I. Guyon, A. Elisseeff, An introduction to variable and feature selection, The Journal of Machine Learning Research, Vol. 3, pp. 1157-1182, 2003.

Prasad, N.S., Rao, K.R., Subspace clustering of text documents using collection and document frequencies of terms, (2014) International Review on Computers and Software (IRECOS), 9 (10), pp. 1692-1699.
http://dx.doi.org/10.15866/irecos.v9i10.3894

C. Chiarello, C. Burgess, L. Richards, A. Pollock, Semantic and Associative Priming in the Cerebral Hemispheres: Some words do, some words don't sometimes, some places, Brain and Language, Vol. 38, n. 1, pp. 75-104, 1990.
http://dx.doi.org/10.1016/0093-934x(90)90103-n

J. Hartigan, M. Wong, Algorithm as 136: A k-means clustering algorithm, Applied Statistics, Vol. 28, n. 1, pp. 100-108, 1979.
http://dx.doi.org/10.2307/2346830

Wu, J., Xiong, H., Chen, J., Adapting the right measures for k-means clustering, Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data mining (Page: 877-886 Year of Publication: 2009).
http://dx.doi.org/10.1145/1557019.1557115

E. Amigό, J. Gonzalo, J. Artiles, F. Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information retrieval, Vol. 12, n. 4, pp. 461-486, 2009.
http://dx.doi.org/10.1007/s10791-008-9066-8

S. Dongen, Performance Criteria for Graph Clustering and Markov Cluster Experiments, 2000.

Bagga, A., Baldwin, B., Entity-based cross-document co-referencing using the vector space model, Proceedings of the 17th International Conference on Computational Linguistics (Page: 79-85 Year of Publication: 1998).
http://dx.doi.org/10.3115/980451.980859

Refbacks

There are currently no refbacks.

Username
Password
Remember me