Open Access Open Access  Restricted Access Subscription or Fee Access

An Optimized Feature Set Based on Genetic Algorithm for Business Web Pages Named Entity Recognition


(*) Corresponding author


Authors' affiliations


DOI: https://doi.org/10.15866/ireaco.v9i5.9698

Abstract


Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.
Copyright © 2016 Praise Worthy Prize - All rights reserved.

Keywords


Named Entity Recognition; Feature Selection; Genetic Algorithm; Support Vector Machine; Web Pages

Full Text:

PDF


References


Matko Boanjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and Luís Sarmento, "TwitterEcho: a distributed focused crawler to support open research with twitter data," in Proceedings of the 21st international conference companion on World Wide Web, 2012, pp. 1233-1240.
http://dx.doi.org/10.1145/2187980.2188266

Sachin Bojewar, Varsha Bhosale, and Shuveta Chanchlani, "Data Extraction from dynamic web pages based on visual features," International Journal of Advanced Engineering Research and Studies, vol. 1, pp. 91-94, 2012.
http://dx.doi.org/10.1007/978-3-642-27997-3_37

Lidong Bing, Tak-Lam Wong, and Wai Lam, "Unsupervised Extraction of Popular Product Attributes from Web Sites," in Information Retrieval Technology, ed: Springer, 2012, pp. 437-446.
http://dx.doi.org/10.1007/978-3-642-35341-3_39

Ruihua Song, Haifeng Liu, Ji-Rong Wen, and Wei-Ying Ma, "Learning block importance models for web pages," in Proceedings of the 13th international conference on World Wide Web, 2004, pp. 203-211..
http://dx.doi.org/10.1145/988672.988700

Qingqing Zhang, Peiquan Jin, Sheng Lin, and Lihua Yue, "Extracting focused locations for web pages," in Web-Age Information Management, ed: Springer, 2012, pp. 76-89.
http://dx.doi.org/10.1007/978-3-642-28635-3_7

UAE Yello Pages. (2016). UAE Online Business Directory. Available: http://www.yellowpages.ae/

Y. Benajiba, M. Diab, and P. Rosso, "Arabic Named Entity Recognition: A Feature-Driven Study," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 17, pp. 926-934, 2009.
http://dx.doi.org/10.1109/tasl.2009.2019927

Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib, "Integrating rule-based system with classification for Arabic named entity recognition," in Computational Linguistics and Intelligent Text Processing, ed: Springer, 2012, pp. 311-322.
http://dx.doi.org/10.1007/978-3-642-28604-9_26

"Part-of-Speech Tagger," Stanford, ed, 2014.

Hafedh Shabat, Nazlia Omar, and Khmael Rahem, "Named Entity Recognition in Crime Using Machine Learning Approach," in Information Retrieval Technology, ed: Springer, 2014, pp. 280-288.
http://dx.doi.org/10.1007/978-3-319-12844-3_24

Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís, "Named entity recognition: fallacies, challenges and opportunities," Computer Standards & Interfaces, vol. 35, pp. 482-489, 2013.
http://dx.doi.org/10.1016/j.csi.2012.09.004

Shaodian Zhang and Noémie Elhadad, "Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts," Journal of biomedical informatics, vol. 46, pp. 1088-1098, 2013.
http://dx.doi.org/10.1016/j.jbi.2013.08.004

David Nadeau, Peter Turney, and Stan Matwin, "Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity," 2006.
http://dx.doi.org/10.1007/11766247_23

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S Weld, and Alexander Yates, "Unsupervised named-entity extraction from the web: An experimental study," Artificial Intelligence, vol. 165, pp. 91-134, 2005.
http://dx.doi.org/10.1016/j.artint.2005.03.001

Marius Pasca, "Acquisition of categorized named entities for web search," in Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004, pp. 137-145.
http://dx.doi.org/10.1145/1031171.1031194

Marius Paşca, "Weakly-supervised discovery of named entities using web search queries," in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 683-690.doi.
http://dx.doi.org/10.1145/1321440.1321536

Mikio Yamamoto and Kenneth W Church, "Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus," Computational Linguistics, vol. 27, pp. 1-30, 2001.
http://dx.doi.org/10.1162/089120101300346787

Lohann Ferreira, Mariza Dosciatti, Julio Nievola, and Emerson Cabrera Paraiso, "Using a Genetic Algorithm Approach to Study the Impact of Imbalanced Corpora in Sentiment Analysis," in The Twenty-Eighth International Flairs Conference, 2015.
http://dx.doi.org/10.1109/bibe.2015.7367651

Jin Huang, Jingjing Lu, and Charles X Ling, "Comparing naive Bayes, decision trees, and SVM with AUC and accuracy," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 553-556.doi.
http://dx.doi.org/10.1109/icdm.2003.1250975

Md Hasanuzzaman, Sriparna Saha, and Asif Ekbal, "Feature subset selection using genetic algorithm for named entity recognition," in Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, PACLIC, 2011, pp. 153-162.doi.
http://dx.doi.org/10.1109/ictai.2010.125

Asif Ekbal and Sriparna Saha, "Weighted vote-based classifier ensemble for named entity recognition: a genetic algorithm-based approach," ACM Transactions on Asian Language Information Processing (TALIP), vol. 10, p. 9, 2011.
http://dx.doi.org/10.1145/1967293.1967296

Kassem, A., El-Bayoumi, G., Habib, T., Kamalaldin, K., Improving Satellite Orbit Estimation Using Commercial Cameras, (2015) International Review of Aerospace Engineering (IREASE), 8 (5), pp. 174-178.
http://dx.doi.org/10.15866/irease.v8i5.8279

Ab Aziz, N., Abdul Rahman, T., Zakaria, Z., Reactive Power Planning for Maximum Load Margin Improvement Using Fast Artificial Immune Support Vector Machine (FAISVM), (2014) International Review of Automatic Control (IREACO), 7 (5), pp. 436-447.
http://dx.doi.org/10.15866/ireaco.v7i5.2361

Demidova, L., Sokolova, Y., Nikulchev, E., Use of Fuzzy Clustering Algorithms Ensemble for SVM Classifier Development, (2015) International Review on Modelling and Simulations (IREMOS), 8 (4), pp. 446-457.
http://dx.doi.org/10.15866/iremos.v8i4.6825

Lakshmi, T., Sastry, P., Rajinikanth, T., Hybrid Approach for Telugu Handwritten Character Recognition Using k-NN and SVM Classifiers, (2015) International Review on Computers and Software (IRECOS), 10 (9), pp. 923-929.
http://dx.doi.org/10.15866/irecos.v10i9.7249


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize