Open Access Open Access  Restricted Access Subscription or Fee Access

Feature Selection Based on Multi Resolution Analysis of Text Documents for Effective Clustering


(*) Corresponding author


Authors' affiliations


DOI: https://doi.org/10.15866/irea.v9i6.20782

Abstract


Users of digital era are overwhelmed with large volumes of text collections. Most of the text collections are without class labels. Clustering is the only feasible solution to extract valuable insights from the data. Clustering of text collections in high dimensional space is inefficient. Several unsupervised dimensionality reduction methods have been proposed in the literature. Feature selection methods are easy to interpret. Filter feature selection methods have been proved to be scalable and efficient for high dimensional datasets. The aim of this work is to propose an unsupervised univariate filter feature selection method for efficient clustering of very high dimensional text datasets in low dimensional feature space. Wavelets are mathematical functions that can transform a signal into space or time, frequency domain to analyze the signal at different resolutions of transformed domain. Wavelet transforms are efficient in identifying transients of a signal. Stationary wavelet transform using Symlet of order 2 is used to identify the most discriminant features of text documents for efficient clustering in low dimensional feature space. The proposed feature selection method is compared with nine other relevant methods by their quality of clustering solution on seven real text document collections. The proposed method has been able to identify the most discriminative features that have resulted in the best peak clustering performance better than clustering performance with all the features, on four out of seven datasets, with at most 1.5% of top-rated features.
Copyright © 2021 Praise Worthy Prize - All rights reserved.

Keywords


Feature Selection; Unsupervised; Filter; Text Clustering; Multi Resolution Analysis

Full Text:

PDF


References


Z. Liu, Y. Lin, M. Sun, Document Representation, Representation Learning for Natural Language Processing (Springer, Inc., 2020).
https://doi.org/10.1007/978-981-15-5573-2

S. Solorio-Fernández, J.A. Carrasco-Ochoa, J.F. Martínez-Trinidad, Ranking based unsupervised feature selection methods: An empirical comparative study in high dimensional datasets, Mexican International Conference on Artificial Intelligence, October, 2018, Guadalajara, Mexico.
https://doi.org/10.1007/978-3-030-04491-6_16

Z. Wang, H. Zhang, J. Wang, Y. Pu, N.R. Pal, Feature selection using a neural network with group lasso regularization and controlled redundancy, IEEE transactions on neural networks and learning systems, vol. 32 n 03, March 2021, pp. 1110 - 1123.
https://doi.org/10.1109/TNNLS.2020.2980383

G.T. Reddy, M.P. Reddy, K. Lakshmanna, R. Kaluri, D.S. Rajput, G. Srivastava, T. Baker, Analysis of dimensionality reduction techniques on big data, IEEE Access, vol. 08, March 2020, pp. 54776 - 54788.
https://doi.org/10.1109/ACCESS.2020.2980942

S. Solorio-Fernández, J.A. Carrasco-Ochoa, J.F. Martínez-Trinidad, A review of unsupervised feature selection methods, Artificial Intelligence Review, vol. 53 n. 02, February 2020, pp. 907 - 948.
https://doi.org/10.1007/s10462-019-09682-y

S. Solorio-Fernández, J.A. Carrasco-Ochoa, J.F. Martínez-Trinidad, A systematic evaluation of filter Unsupervised Feature Selection methods, Expert Systems with Applications, vol. 162, December 2020, pp. 113745.
https://doi.org/10.1016/j.eswa.2020.113745

Agarwal, G. Sikka, L.K. Awasthi, Enhancing Web Service Clustering using Length Feature Weight Method for Service Description Document Vector Space Representation, Expert Systems with Applications, vol. 161, December 2020, pp. 113682.
https://doi.org/10.1016/j.eswa.2020.113682

G. Roffo, S. Melzi, U. Castellani, A. Vinciarelli and M. Cristani, Infinite Feature Selection: A Graph-based Feature Filtering Approach, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4396-4410, 1 Dec. 2021.
https://doi.org/10.1109/TPAMI.2020.3002843

L.M.Q. Abualigah, Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering (Springer, Inc., 2019).
https://doi.org/10.1007/978-3-030-10674-4

N. Sivaram Prasad, K. Rajasekhara Rao, Feature Selection based on Term Frequency and Term Document Frequency for Text Clustering, International Journal of Applied Engineering Research, vol. 10 n. 10, May 2015, pp. 26175 - 26190.

Sivaram Prasad, N., Rajasekhara Rao, K., Effective Clustering of Text Documents in Low Dimension Space Using Semantic Association Among Terms, (2015) International Review on Computers and Software (IRECOS), 10 (5), pp. 467-474.
https://doi.org/10.15866/irecos.v10i5.5847

Sivaram Prasad, N., Rajasekhara Rao, K., Subspace Clustering of Text Documents Using Collection and Document Frequencies of Terms, (2014) International Review on Computers and Software (IRECOS), 9 (10), pp. 1692-1699.
https://doi.org/10.15866/irecos.v9i10.3894

X. He, D. Cai, P. Niyogi, Laplacian Score for Feature Selection, Proceedings of the 18th International Conference on Neural Information Processing Systems, December, 2005, Vancouver, British Columbia, Canada.
https://dl.acm.org/doi/10.5555/2976248.2976312

L. Liu, J. Kang, J. Yu, Z. Wang, A Comparative Study on Unsupervised Feature Selection Methods for Text Clustering, Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering, October 30-November 01, 2005, Wuhan, China.

T. Liu, S. Liu, Z. Chen, W. Y. Ma, An Evaluation on Feature Selection for Text Clustering, Proceedings of the 20th International Conference on Machine Learning (ICML-03), August 21-24, 2003, Washington, DC, USA.
https://dl.acm.org/doi/10.5555/3041838.3041900

L. S. Safavian, W. Kinsner, H. Turanli, A Quantitative Comparison of Different Mother Wavelets for Characterizing Transients in Power Systems, Proceedings of the Canadian Conference on Electrical and Computer Engineering, May 01-04, 2005, Saskatoon, SK, Canada.

W. K. Ngui, M. S. Leong, L. M. Hee, A. M. Abdelrhman, Wavelet Analysis: Mother Wavelet Selection Methods, In W. Kuntjoro (Ed.), Applied Mechanics and Materials, vol. 393, (Switzerland: Trans Tech, 2013, 953-958).
https://doi.org/10.4028/www.scientific.net/AMM.393.953

Moosaviyan, I., Seifossadat, S., Kianinezhad, R., Fault Location in High Voltage Transmission Line with Current Traveling Wave, (2018) International Journal on Engineering Applications (IREA), 6 (4), pp. 112-117.
https://doi.org/10.15866/irea.v6i4.16019

M. Mehra, Wavelets Theory and Its Applications A First Course (Springer, Inc., 2018).
https://doi.org/10.1007/978-981-13-2595-3

A. Gnutti, F. Guerrini, N. Adami, P. Migliorati, R. Leonardi, A wavelet filter comparison on multiple datasets for signal compression and denoising, Multidimensional Systems and Signal Processing, vol. 32 n. 2, April 2021, pp. 791 - 820.
https://doi.org/10.1007/s11045-020-00753-w

R. C. Guido, F. Pedroso, A. Furlan, R. C. Contreras, L. G. Caobianco, J. S. Neto, CWT× DWT× DTWT× SDTWT: Clarifying terminologies and roles of different types of wavelet transforms, International Journal of Wavelets, Multiresolution and Information Processing, vol. 18 n. 6, November 2020, pp. 2030001.
https://doi.org/10.1142/S0219691320300017

Lachhab, S., Bourogaoui, M., Sethom, H., A Novel Hybrid Method for Generalized Thresholds-Based Multiple Faults Detection and Localization in PMSM Drives, (2019) International Review on Modelling and Simulations (IREMOS), 12 (3), pp. 176-187.
https://doi.org/10.15866/iremos.v12i3.17080

N. Bhatnagar, Introduction to Wavelet Transforms (C.R.C Press 2020).
https://doi.org/10.1201/9781003006626

G. Strang, T. Nguyen, Wavelets and Filter Banks (SIAM, 1996).

K. Zhou, S. Yang, Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering, Pattern Anal Applic, vol. 23, 2020, pp. 455 - 466.
https://doi.org/10.1007/s10044-019-00783-6

J. Wu, H. Xiong, J. Chen, Adapting the Right Measures for K-Means Clustering, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June, 2009, Paris, France.
https://doi.org/10.1145/1557019.1557115

A. Bagga, B. Baldwin, Entity-based Cross-document Co-referencing using the Vector Space Model, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, August 15-16, 1998, Montréal, Canada.
https://doi.org/10.3115/980845.980859

E. Amigó, J. Gonzalo, J. Artiles, F. Verdejo, Algorithm AS 136: A Comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints, Information Retrieval, vol. 12 n. 4, August 2009, pp. 461 - 486.
https://doi.org/10.1007/s10791-008-9066-8

S.K. Sharma, Statistical Methods in Scientific Research. In Scientific Methods Used in Research and Writing (CRC Press, 2020).
https://doi.org/10.1201/9781003119180-2

Hendel, M., Benyettou, A., Hendel, F., Fusion of Direct Probabilistic Multi-Class Support Vector Machines to Enhance Mental Tasks Recognition Performance in BCI Systems, (2018) International Journal on Communications Antenna and Propagation (IRECAP), 8 (5), pp. 430-438.
https://doi.org/10.15866/irecap.v8i5.14068

Lakrih, S., Diouri, J., Multifractal Power Network Based on the Two Scales Cantor Set Topology, (2021) International Review of Electrical Engineering (IREE), 16 (3), pp. 236-246.
https://doi.org/10.15866/iree.v16i3.19632

Matar, M., Mohamed, O., Fault Classification on a Power Transmission Line Using Discrete Wavelet Transform and Artificial Neural Networks, (2019) International Review of Electrical Engineering (IREE), 14 (5), pp. 349-357.
https://doi.org/10.15866/iree.v14i5.17017


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2023 Praise Worthy Prize