A Document Level Measure for Text Categorization


(*) Corresponding author


Authors' affiliations


DOI's assignment:
the author of the article can submit here a request for assignment of a DOI number to this resource!
Cost of the service: euros 10,00 (for a DOI)

Abstract


The term weighting scheme in text categorization is a vital step in automatic text categorization. Previous studies showed that term weighting techniques contribute more to the accuracy of classification than that of the classifier’s contribution for the same. So this work is concentrated on term weighting schemes for text categorization. A new supervised term weighting scheme for text categorization is proposed. The frequency of each term in a document is expressed as probability of the terms in a document. This gives the proportion of each term in a document. This information provides with a very good knowledge on the category of the document. The probability of a term in all the documents of a class when summed up leads to a very important variable which can be used for term weighting in classification. This is basically a document level variable because it is related to the probability of a term in a document. The related new measure is named as td (terms in a document). Its performance when evaluated with reuters-21578 and 20Newsgroup dataset showed interesting increase in performance compared to tf, idf and rf. Compared to rf, this measure works well for both svm (binary classifier) and centroid-based classifiers(multiclass classifier).
Copyright © 2013 Praise Worthy Prize - All rights reserved.

Keywords


Classifier; Document; Feature Selection; Term Weighting; Text Categorization

Full Text:

PDF


References


Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.

Aggarwal, C. C., Gates, S. C., & Yu, P. S. (2004). On Using Partial Supervision for Text Categorization. IEEE Transactions on Knowledge and Data Engineering, Vol. 16, NO. 2, 245-255.

Amine, A., Elberrichi, Z., Simonet, M., WordNet-based text clustering methods: Evaluation and comparative study, (2009) International Review on Computers and Software (IRECOS), 4 (2), pp. 220-228.

Salton, .G, & McGill, M. J.( 1983). Introduction to Modern Retrieval. McGraw-Hill Book Company

Rogati, M.,& Yang, Y. (2002). High-performing feature selection for text classification. Proceedings of the eleventh international conference on Information and knowledge management, 659-661.

Deng, Z. H., Tang, S. W., Yang, D. Q., Zang, M., Li, L. Y., & Xie, K. Q.(2004) .A Comparative Study on Feature Weight in Text Categorization. 6th Asia Pacific Web Conference, Hangzhou, China.

Yang, Y., & Pederson, J. P. (1997). A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, 412-420.

Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. Proceedings of the 2003 ACM symposium on Applied computing, ACM Press, 784-788.

Lan, M., Tan, C. L., Su, J., & Lu Y. (2009). Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-35.

Leopold, E., & Kindermann, J. ( 2002). Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?. Machine Learning 46, 423 - 444.

Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24, 513-523.

Xue, X., & Zhou, Z. (2009). Distributional Features for Text Categorization. IEEE Transactions on Knowledge and Data Engineering, 428-442.

Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), 11–21.

Verayuth, L., Thanaruk, T. (2004). Effect of term distributions on centroid-based text categorization, Information Sciences, 89–115.

Verayuth, L. , Thanaruk , T. (2006) . Class normalization in centroid-based text categorization, Information Sciences, 176 1712–1738

Wang, D., Zhang, H. (2013). Inverse Category Frequency based supervised term weighting scheme for text categorization. Journal of Information Science and Engineering 29, 209-225.

Ying L., Han T. L., Aixin S. (2009). Imbalanced text classification: A term weighting approach, Expert Systems with Applications, 36 690–701.

Lewis, D. D. (1998). Reuters-21578 Text Categorization Collection Data Set. AT&T Labs - Research.

Yang, Y., Liu, X. (1996). A Re-examination of Text Categorization Methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US

Williams, K. (2003). CPAN, http://search.cpan.org/~kwilliams/Statistics-Contingency-0.08/Contingency.pm

Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 137-142.

Vapnik, V. (1995). The nature of statistical Learning Theory. Springer , NewYork.

Chang C., & Lin C.J, (2001). LIBSVM: a library for support vector machines. Software available at http://www. csie. ntu. edu. tw/cjlin/ libsvm.

Guan H., Zhou J., and Guo M., 2009. “A class-feature-centroid classifier for text categorization”, In Proceedings of the 18th international conference on world wide web.

Hidayet, T., Tunga G. (2012). A high performance centroid-based classification approach for language identification, Pattern Recognition Letters, 2077-2084.

Porter, M. ( 1980). An algorithm for suffix stripping. Program, 14(3), 130-137.


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize