An Optimized Feature Set Based on Genetic Algorithm for Business Web Pages Named Entity Recognition

Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain more entities such as addresses, URLs, telephones and faxes. With the variety of features that have been used for NER, identifying the best feature set for extracting NEs from web pages is a challenging task. This paper propose an optimized feature set for extracting NEs from web pages based on GA. The dataset was collected from business web pages. Whilst, the feature set consists of text features such as n-gram and web features such as block position and font type. Finally, a SVM classifier was used to classify the NEs. Results shown that Genetic Algorithm has the ability to identify the most accurate features.
Named Entity Recognition; Feature Selection; Genetic Algorithm; Support Vector Machine; Web Pages

