Open Access Open Access  Restricted Access Subscription or Fee Access

An Optimized Feature Set Based on Genetic Algorithm for Business Web Pages Named Entity Recognition

(*) Corresponding author

Authors' affiliations



Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.
Copyright © 2016 Praise Worthy Prize - All rights reserved.


Named Entity Recognition; Feature Selection; Genetic Algorithm; Support Vector Machine; Web Pages

Full Text:



Matko Boanjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and Luís Sarmento, "TwitterEcho: a distributed focused crawler to support open research with twitter data," in Proceedings of the 21st international conference companion on World Wide Web, 2012, pp. 1233-1240.

Sachin Bojewar, Varsha Bhosale, and Shuveta Chanchlani, "Data Extraction from dynamic web pages based on visual features," International Journal of Advanced Engineering Research and Studies, vol. 1, pp. 91-94, 2012.

Lidong Bing, Tak-Lam Wong, and Wai Lam, "Unsupervised Extraction of Popular Product Attributes from Web Sites," in Information Retrieval Technology, ed: Springer, 2012, pp. 437-446.

Ruihua Song, Haifeng Liu, Ji-Rong Wen, and Wei-Ying Ma, "Learning block importance models for web pages," in Proceedings of the 13th international conference on World Wide Web, 2004, pp. 203-211..

Qingqing Zhang, Peiquan Jin, Sheng Lin, and Lihua Yue, "Extracting focused locations for web pages," in Web-Age Information Management, ed: Springer, 2012, pp. 76-89.

UAE Yello Pages. (2016). UAE Online Business Directory. Available:

Y. Benajiba, M. Diab, and P. Rosso, "Arabic Named Entity Recognition: A Feature-Driven Study," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 17, pp. 926-934, 2009.

Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib, "Integrating rule-based system with classification for Arabic named entity recognition," in Computational Linguistics and Intelligent Text Processing, ed: Springer, 2012, pp. 311-322.

"Part-of-Speech Tagger," Stanford, ed, 2014.

Hafedh Shabat, Nazlia Omar, and Khmael Rahem, "Named Entity Recognition in Crime Using Machine Learning Approach," in Information Retrieval Technology, ed: Springer, 2014, pp. 280-288.

Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís, "Named entity recognition: fallacies, challenges and opportunities," Computer Standards & Interfaces, vol. 35, pp. 482-489, 2013.

Shaodian Zhang and Noémie Elhadad, "Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts," Journal of biomedical informatics, vol. 46, pp. 1088-1098, 2013.

David Nadeau, Peter Turney, and Stan Matwin, "Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity," 2006.

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S Weld, and Alexander Yates, "Unsupervised named-entity extraction from the web: An experimental study," Artificial Intelligence, vol. 165, pp. 91-134, 2005.

Marius Pasca, "Acquisition of categorized named entities for web search," in Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004, pp. 137-145.

Marius Paşca, "Weakly-supervised discovery of named entities using web search queries," in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 683-690.doi.

Mikio Yamamoto and Kenneth W Church, "Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus," Computational Linguistics, vol. 27, pp. 1-30, 2001.

Lohann Ferreira, Mariza Dosciatti, Julio Nievola, and Emerson Cabrera Paraiso, "Using a Genetic Algorithm Approach to Study the Impact of Imbalanced Corpora in Sentiment Analysis," in The Twenty-Eighth International Flairs Conference, 2015.

Jin Huang, Jingjing Lu, and Charles X Ling, "Comparing naive Bayes, decision trees, and SVM with AUC and accuracy," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 553-556.doi.

Md Hasanuzzaman, Sriparna Saha, and Asif Ekbal, "Feature subset selection using genetic algorithm for named entity recognition," in Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, PACLIC, 2011, pp. 153-162.doi.

Asif Ekbal and Sriparna Saha, "Weighted vote-based classifier ensemble for named entity recognition: a genetic algorithm-based approach," ACM Transactions on Asian Language Information Processing (TALIP), vol. 10, p. 9, 2011.

Kassem, A., El-Bayoumi, G., Habib, T., Kamalaldin, K., Improving Satellite Orbit Estimation Using Commercial Cameras, (2015) International Review of Aerospace Engineering (IREASE), 8 (5), pp. 174-178.

Ab Aziz, N., Abdul Rahman, T., Zakaria, Z., Reactive Power Planning for Maximum Load Margin Improvement Using Fast Artificial Immune Support Vector Machine (FAISVM), (2014) International Review of Automatic Control (IREACO), 7 (5), pp. 436-447.

Demidova, L., Sokolova, Y., Nikulchev, E., Use of Fuzzy Clustering Algorithms Ensemble for SVM Classifier Development, (2015) International Review on Modelling and Simulations (IREMOS), 8 (4), pp. 446-457.

Lakshmi, T., Sastry, P., Rajinikanth, T., Hybrid Approach for Telugu Handwritten Character Recognition Using k-NN and SVM Classifiers, (2015) International Review on Computers and Software (IRECOS), 10 (9), pp. 923-929.


  • There are currently no refbacks.

Please send any question about this web site to
Copyright © 2005-2024 Praise Worthy Prize