XML Document Classification by Frequent Itemset Mining on Menonym Tree


(*) Corresponding author


Authors' affiliations


DOI's assignment:
the author of the article can submit here a request for assignment of a DOI number to this resource!
Cost of the service: euros 10,00 (for a DOI)

Abstract


XML has been used as a universal format to design the documents on web, because Mark-up language created using XML for any application does not place any restriction on the number of tags that can be defined. The flexibility to create user-defined tags in XML enables smart searches in large data.  The structure of XML provides sophisticated proximity measures as the distance between the last word in one element and the first word in the next element is greater than the distance between adjacent words in the same element, even though their physical proximity in the document is similar.  As large number of XML documents are used in the web, classification is needed for efficient data retrieval.  This paper is based on document classification for XML semi structured data. In this work, it is proposed to exploit the structure feature of XML data and construct a weighted term frequency feature vector with frequent itemset mining of metonymy tree. Two different classifiers: Naive Bayesian Classifier and K-Nearest Neighbourhood Classifier are used for classifying the extracted features. Reuter’s dataset is used for evaluating the performance of classifiers is compared
Copyright © 2014 Praise Worthy Prize - All rights reserved.

Keywords


Document Classification; Metonymy Tree; Frequent Itemset Mining; Naives Bayesian Classifier; K-Nearest Neighborhood Classifier

Full Text:

PDF


References


S. Lawrence, C. L. Giles and K. Bollacker, (1999) “Digital Libraries and Autonomous Citation Indexing”, IEEE Computer, Vol.32,No.6, pp.67-71.

Y. H. Li and A. K. Jain, (1998) “Classification of Text Documents “, the Computer Journal, Vol. 41, No. 8.

Faloutsos, C. and Oard, D. (1995) A Survey of Information Retrieval and Filtering Methods. Technical Report CS-TR- 3541, University of Maryland.

Miao, D., Duan, Q., Zhang, H., & Jiao, N. (2009). Rough set based hybrid algorithm for text classification. Expert Systems with Applications, Vol.36, No.5, pp.9168-9174.

Bloehdorn, S., &Hotho, A. (2004). Boosting for Text Classification with Semantic Features. In Proceedings of the 6th International Workshop on Knowledge Discovery on the Web (WebKDD) pp. 149-166.

Shashirekha H.L., Vanishree K.S., And Sumangala N,(2011) “Content and Structure Based Classification of XML Documents “, International Journal Of Machine Intelligence , Vol. 3, No. 4, pp-376-380.

Laurent Candillier, Isabelle Tellier, and Fabien Torre,(2005) “Transforming XML trees for efficient classification and clustering”, INEX pp.469-480

Abidin, S. Z Z. Idris, N.M., Husain, A.H.,(2010) "Extraction and classification of unstructured data in WebPages for structured multimedia database via XML," International Conference on Information Retrieval & Knowledge Management, (CAMP), 2010, pp.44-49.

Tagarelli, A., & Greco, S. (2010). “Semantic clustering of XML documents”, ACM Transactions on Information Systems (TOIS), Vol.28, No.1,pp. 3-8.

Wu, J., (2012),"A Framework for Learning Comprehensible Theories in XML Document Classification," Knowledge and Data Engineering, IEEE Transactions on , vol.24, no.1, pp.1-14.

Costa, G.; Ortale, R.; Ritacco, E., (2011)"Effective XML Classification Using Content and Structural Information via Rule Learning," IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp.102-109.

Zeng Jun, (2010), "Dynamic Clustering Analysis of Expert Instances with XML Structural Based on the Particle Swarm Algorithm," International Conference on Management and Service Science (MASS), pp.1-3.

Khabbaz, M.; Kianmehr, K. and Alhajj, R.,(2012)"Employing Structural and Textual Feature Extraction for Semi structured Document Classification," , IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol.42, no.6, pp.1566-1578.

Kwanho Kim, Beom-Suk Chung, Ye Rim Choi, Jonghun Park,(2011) "Semantic Pattern Tree Kernels for Short-Text Classification," IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC),pp.1250-1252.

MeijuanGao, Jing wenTian and Shiru Zhou, (2009),"Research of web classification mining based on classify support vector machine," . ISECS International Colloquium on Computing, Communication, Control, and Management , vol.2, pp.21,24.

Navaneethakumar, V.M., (2013), "Mining conceptual rules for web document using sentence ranking conditional probability," International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME), pp.169-174.

Qiu Wei, (2009) "Research and Application of XML Documents Query Based on Weight Cost," Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on , vol.1, no., pp.525-528.

Btah, F.; Gharaibeh, O. and Abdeljaber, H., (2011) "Comparison of rule based classification techniques for the Arabic textual data," Innovation in Information & Communication Technology (ISIICT), 2011 Fourth International Symposium on , vol., no., pp.105-111.

Maiorana, F.,(2012) "A semantically enriched medical literature mining framework," International Symposium on Computer-Based Medical Systems (CBMS), pp.1-4.

Bedi, S.S.; Yadav, H.; Yadav, P., (2009)"Categorization, clustering and association rule mining on WWW," International Conference Multimedia, Signal Processing and Communication Technologies, IMPACT , pp.173-177.

C. L., Tseng, F. S., & Liang, T. (2010).Mining fuzzy frequent itemsets for hierarchical document clustering. Information processing & management, 46(2), 193-211.

J.Sreemathy and P. S. Balamurugan, (2012) “ An Efficient Text Classification Using Knn And Naive Bayesian” , International Journal on Computer Science and Engineering (IJCSE), Vol. 4 No. 3-5

Fox, C. (1992) Lexical analysis and stoplist. In Frakes,W. and Baeza-Yates, R. (eds), Information Retrieval Data Structures and Algorithms, pp. 102–130. Prentice Hall, Englewood Cliffs, NJ.

Robertson, S. (2004). “Understanding inverse document frequency: on theoretical arguments for IDF”. Journal of Documentation, Vol.60,No.5, 503-520.

Rion Snow, Daniel Jurafsky and Andrew Y. Ng, (2004),“Learning syntactic patterns for automatic hypernym discovery", In Advances in Neural Information Processing Systems.pp.1-7

Han and Kamber, “Data warehousing and Data Mining”, Third Edition.

Priyadharshini, M., Baskaran, R., Balaji, N., Saleem Basha, M.S., Analysis on countering XML-based attacks in web services, (2013) International Review on Computers and Software (IRECOS), 8 (9), pp. 2197-2204.

Jayaprakash, C., Maheswari, V., Optimized reliable mechanism for improving web service using WDDX and XML, (2013) International Review on Computers and Software (IRECOS), 8 (7), pp. 1688-1692.


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize