Subspace Clustering of Text Documents Using Collection and Document Frequencies of Terms

N. Sivaram Prasad; K. Rajasekhara Rao

doi:10.15866/irecos.v9i10.3894

Subspace Clustering of Text Documents Using Collection and Document Frequencies of Terms

N. Sivaram Prasad^(1*), K. Rajasekhara Rao⁽²⁾

^(*) Corresponding author

Authors' affiliations

DOI: https://doi.org/10.15866/irecos.v9i10.3894

Abstract

The most widely used document representation model is the Vector Space Model. Higher dimensions and sparseness of the representation model leads to poor clustering performance, demanding more computational effort for clustering. Hence, dimension reduction techniques are used to find a feature subspace for document representation that could enhance clustering performance. This paper proposes a novel unsupervised filter method for feature selection. Feature selection methods represent documents using a subset of the original feature set that maximizes the separation among classes of documents in the collection. Filter methods analyze the intrinsic properties of the documents and they select highly-ranked features according to some criterion, quite different to clustering task. Unsupervised feature selection methods do not use class labels to guide the selection of features. The proposed method assigns a score to a term using its collection and document frequencies. Number of times and number of documents in which a term appears in a document collection are called respectively collection frequency and document frequency of the term. Empirical evaluations proved that the proposed method is not only effective in selecting features giving best clustering performance, but also less computationally complex, when compared to other unsupervised feature selection methods.
Copyright © 2014 Praise Worthy Prize - All rights reserved.

Keywords

Clustering; Collection Frequency; Document Frequency; Term; Text Document; Unsupervised Feature Selection

Full Text:

PDF

References

G. Salton, A. Wong, C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, n. 11, pp. 613-620, 1975.
http://dx.doi.org/10.1145/361219.361220

M. Keikha, N. S. Razavian, F. Oroumchian, H. S. Razi, Document representation and quality of text: An Analysis , In M. W. Berry, M Castellanos, Survey of Text Mining II, (London: Springer-Verlag, 2008, 219-232).
http://dx.doi.org/10.1007/978-1-84800-046-9_12

C. Fox, A stop list for general text, In ACM SIGIR forum, 24, 1-2 (New York: ACM, 1989, 19-21).
http://dx.doi.org/10.1145/378881.378888

M. F. Porter, An algorithm for suffix stripping, Program: electronic library and information systems, Vol. 40, n. 3, pp. 211-218, 2006.
http://dx.doi.org/10.1108/00330330610681286

Mary Amala Bai, V., Manimegalai, D., A document level measure for text categorization, (2013) International Review on Computers and Software (IRECOS), 8 (6), pp. 1374-1381.

Vandar Kuzhali, J., Vengataasalam, S., Hybrid model based feature selection approach using Kernel PCA for large datasets, (2013) International Review on Computers and Software (IRECOS), 8 (11), pp. 2586-2592.

R. Kohavi, G. H. John, Wrappers for feature subset selection, Artificial intelligence, Vol. 97, n. 1, pp. 273-324, 1997.
http://dx.doi.org/10.1016/s0004-3702(97)00043-x

Vimaladevi, M., Kalaavathi, B., Microarray gene ranking technique based on modified successive feature selection algorithm, (2014) International Review on Computers and Software (IRECOS), 9 (3), pp. 592-598.

P. Cunningham, Dimension Reduction, In M. Cord, P. Cunningham, Machine learning techniques for multimedia, (Berlin Heidelberg: Springer-Verlag, 2008, 91-112).
http://dx.doi.org/10.1007/978-3-540-75171-7_4

Liu, L., Kang, J., Yu, J., Wang, Z., A Comparative Study on Unsupervised Feature Selection Methods for Text Clustering, Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering (Page: 597-601 Year of Publication: 2005 ISBN: 0-7803-9361-9).
http://dx.doi.org/10.1109/nlpke.2005.1598807

I. Dhillon, J. Kogan, C. Nicholas, Feature selection and document clustering, In M. W. Berry, Survey of text mining, (New York: Springer-Verlag, 2004, 73-100).
http://dx.doi.org/10.1007/978-1-4757-4305-0_4

X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, In NIPS, 186 ( 2005, 189)

Liu, T., Liu, S., Chen, Z., Ma, W. Y, An evaluation on feature selection for text clustering, Proceedings of the 20th International Conference on Machine Learning (Page: 488-495 Year of Publication: 2004 ISBN: 1-5773-5189-4).

J. Hartigan, M. Wong, Algorithm as 136: A k-means clustering algorithm, Applied Statistics, Vol. 28, n. 1, pp. 100-108, 1979.
http://dx.doi.org/10.2307/2346830

A. Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, The Journal of Machine Learning Research, Vol. 3, pp. 583-617, 2003.

Bagga, A., Baldwin, B., Entity-based cross-document co-referencing using the vector space model, Proceedings of the 17th International Conference on Computational Linguistics (Page: 79-85 Year of Publication: 1998).
http://dx.doi.org/10.3115/980451.980859

C. J. Van Rijsbergen, Foundation of evaluation, Journal of Documentation, Vol. 30, n. 4, pp. 365-373, 1974.
http://dx.doi.org/10.1108/eb026584

E. Amigó, J. Gonzalo, J. Artiles, F. Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information retrieval, Vol. 12, n. 4, pp. 461-486, 2009.
http://dx.doi.org/10.1007/s10791-008-9066-8

Refbacks

There are currently no refbacks.

Username
Password
Remember me