Open Access Open Access  Restricted Access Subscription or Fee Access

Word Extraction from Arabic Handwritten Documents Based on Statistical Measures


(*) Corresponding author


Authors' affiliations


DOI: https://doi.org/10.15866/irecos.v11i5.9384

Abstract


In Arabic, word extraction is particularly challenging because words are often divided into sub-words, and a few letters do not connect to the following letter. In this paper, we present an efficient method for extracting words from Arabic handwritten documents. The proposed method is based on two groups of spatial measures (the lengths of connected components (CCs) and the gaps between these CCs) which differentiate successive CCs in text lines. Lengths are clustered into three distinct clusters to identify an optimal threshold for separating isolated letters, sub-words, and words. Besides, Gaps are clustered into two clusters, to indicate whether the gap occurs "between-words" or "within-a word". This clustering is implemented using Self-Organizing Map (SOM) algorithm. The efficiency of the proposed method was tested by conducting experiments on 35 ages of handwritten Arabic text, accessed from benchmarking Database for Arabic Handwritten Text Recognition Research (AHDB). Our tests produced very promising results, achieving a correct extraction rate of 86.3%.
Copyright © 2016 Praise Worthy Prize - All rights reserved.

Keywords


Arabic Handwriting; Word Extraction; SOM Clustering; Handwriting Recognition

Full Text:

PDF


References


The free dictionary http://acronyms.thefreedictionary.com/hwr, June 2016.

Hashem Ghaleb, P. Nagabhushan, and Umapada Pal. Article: Segmentation of overlapped handwritten arabic sub-words. IJCA Proceedings on National conference on Digital Image and Signal Processing, DISP 2015(2):24–29, April 2015. Full text available.

Sargur N. Srihari, Gregory R. Ball, and Harish Srinivasan. Arabic and Chinese Handwriting Recognition: SACH 2006 Summit College Park, MD, USA, September 27-28, 2006 Selected Papers, chapter Versatile Search of Scanned Arabic Handwriting, pages 57–69. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
http://dx.doi.org/10.1007/978-3-540-78199-8_4

M. Zand, A.N. Nilchi, and S.A. Monadjemi. Recognition-based segmentation in persian character recognition. In Proceedings of World Academy of Science: Engineering Technolog, page 183, April 2008.

A. Alaei, P. Nagabhushan, and U. Pal. A baseline dependent approach for persian handwritten character segmentation. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 1977–1980, Aug 2010.
http://dx.doi.org/10.1109/icpr.2010.487

S. Mozaffari, K. Faez, F. Faradji, M. Ziaratban, and S. M. Golzan. A comprehensive isolated farsi/arabic character database handwritten ocr research. In 10th International Workshop on Frontiers in Handwriting Recognition, pages 385–389, 2006.

Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine El-louze, and Hamid Amiri. Ifn/enit - database of handwritten arabic words. In Proc. of CIFED 2002, pages 129–136, 2002.

E. EL-Sherif and S. Abdleazeem. A two-stage system for arabic handwritten digit recognition tested on a new large database. In International Conference on Artificial Intelligence and Pattern Recognition, pages 237–242, 2007.

A. Lawgali, M. Angelova, and A. Bouridane. Hacdb: Handwritten arabic characters database for automatic character recognition. In Visual Information Processing (EUVIP), 2013 4th European Workshop on, pages 255–259, June 2013.

Ahmed Lawgali. A survey on Arabic character recognition. International Journal of Signal Processing, Image Processing and Pattern Recognition, 8:401–426, 2015.
http://dx.doi.org/10.14257/ijsip.2015.8.2.37

H. Goraine, M. Usher, and S. Al-Emami. Off-line arabic character recognition. Computer, 25(7):71–74, July 1992.
http://dx.doi.org/10.1109/2.144444

Husni A. Al-Muhtaseb, Sabri A. Mahmoud, and Rami S. Qahwaji. Recognition of off-line printed arabic text using hidden markov models. Signal Process., 88(12):2902–2912, December 2008.
http://dx.doi.org/10.1016/j.sigpro.2008.06.013

Adnan Amin, Humoud Al-Sadoun, and Stephen Fischer. Hand-printed arabic character recognition system using an artificial network. Pattern Recognition, 29(4):663 – 675, 1996.
http://dx.doi.org/10.1016/0031-3203(95)00110-7

R. El-Hajj, L. Likforman-Sulem, and C. Mokbel. Arabic handwriting recognition using baseline dependant features and hidden markov modeling. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 893–897 Vol. 2, Aug 2005.
http://dx.doi.org/10.1109/icdar.2005.53

Mahmoud Khalifa and Yang BingRu. Advanced Research on Electronic Commerce, Web Application, and Communication: International Conference, ECWAC 2011, Guangzhou, China, April 16-17, 2011. Proceedings, Part I, chapter A Novel Word Based Arabic Handwritten Recognition System Using SVM Classifier, pages 163–171. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
http://dx.doi.org/10.1007/978-3-642-20367-1_26

A. Benouareth, A. Ennaji, and M. Sellami. Hmms with explicit state duration applied to handwritten Arabic word recognition. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 2, pages 897– 900, 2006.
http://dx.doi.org/10.1109/icpr.2006.631

J.H. AlKhateeb, Jinchang Ren, Jianmin Jiang, S.S. Ipson, and H. El Abed. Word-based handwritten Arabic scripts recognition using dct features and neural network classifier. In Systems, Signals and Devices, 2008. IEEE SSD 2008. 5th International Multi-Conference on, pages 1–5, July 2008.
http://dx.doi.org/10.1109/ssd.2008.4632863

S. Alma’adeed, C. Higgens, and D. Elliman. Recognition of off-line handwritten Arabic words using hidden markov model approach. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 481–484 vol.3, 2002.
http://dx.doi.org/10.1109/icpr.2002.1047981

M. T. El-Melegy and A. A. Abdelbaset. Global features for offline recognition of handwritten Arabic literal amounts. In Information and Communications Technology, 2007. ICICT 2007. ITI 5th International Conference on, pages 125–129, Dec 2007.
http://dx.doi.org/10.1109/itict.2007.4475631

V. Madhvanath, S. Govindaraju. The role of holistic paradigms in handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:149–164, February 2001.
http://dx.doi.org/10.1109/34.908966

Sargur Srihari, Harish Srinivasan, Pavithra Babu, and Chetan Bhole. Handwritten Arabic word spotting using the cedarabic document analysis system. In Proc. Symposium on Document Image Understanding Technology (SDIUT-05), College Park, MD, pages 123–132, 2005.
http://dx.doi.org/10.1117/12.643107

Sargur Srihari, Harish Srinivasan, Pavithra Babu, and Chetan Bhole. Spotting words in handwritten arabic documents. In Document Recognition and Retrieval XIII: Proceedings SPIE, 2006.
http://dx.doi.org/10.1117/12.643107

M. Khayyat, L. Lam, and C.Y. Suen. Arabic handwritten word spotting using language models. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on, pages 43–48, Sept 2012.
http://dx.doi.org/10.1109/icfhr.2012.183

S.S. Maddouri, F. Ghazouani, and F.B. Samoud. Text lines and paws segmentation of handwritten arabic document by two hybrid methods. In Advanced Technologies for Signal and Image Processing (ATSIP), 2014 1st International Conference on, pages 310–315, March 2014.
http://dx.doi.org/10.1109/atsip.2014.6834627

Y. Osman. Segmentation algorithm for arabic handwritten text based on contour analysis. In Computing, Electrical and Electronics Engineering (IC-CEEE), 2013 International Conference on, pages 447–452, Aug 2013.
http://dx.doi.org/10.1109/icceee.2013.6633980

N. Aouadi, S. Amiri, and A. Kacem Echi. Segmentation of connected components in Arabic handwritten documents. Procedia Technology, 10:738 – 746, 2013. First International Conference on Computational Intelligence: Modeling Techniques and Applications (CIMTA) 2013.
http://dx.doi.org/10.1016/j.protcy.2013.12.417

Jawad H AlKhateeb, Jianmin Jiang, Jinchang Ren, and Stan Ipson. Interactive knowledge discovery for baseline estimation and word segmentation in handwritten Arabic text. Recent Advances in Technologies, Maurizio A Strangio (Ed.), 2009.
http://dx.doi.org/10.5772/7428

A Al-Dmour and F Fraij. Segmenting arabic handwritten documents into text lines and words. International Journal of Advancements in Computing Technology (IJACT), 6(3):109–119, 2014.

T. Stafylakis, V. Papavassiliou, V. Katsouros, and G. Carayannis. Robust text-line and word segmentation for handwritten documents images. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3393–3396, 2008.
http://dx.doi.org/10.1109/icassp.2008.4518379

U.-V. Marti and H. Bunke. Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 159–163, 2001.
http://dx.doi.org/10.1109/icdar.2001.953775

R. Manmatha and Jamie L. Rothfeder. A scale space approach for automatically segmenting words from historical handwritten documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1212–1225, Aug 2005.
http://dx.doi.org/10.1109/tpami.2005.150

Vassilis Papavassiliou, Themos Stafylakis, Vassilis Katsouros, and George Carayannis. Handwritten document image segmentation into text lines and words. Pattern Recognition, 43(1):369 – 377, 2010.
http://dx.doi.org/10.1016/j.patcog.2009.05.007

G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. Text line and word segmentation of handwritten documents. Pattern Recognition, 42(12):3169–3183, 2009. New Frontiers in Handwriting Recognition.
http://dx.doi.org/10.1016/j.patcog.2008.12.016

Peake, G.S., Tan, T.N., 1997. A general algorithm for document skew angle estimation. IEEE International Conference on Image Process. 2, 230-233.
http://dx.doi.org/10.1109/icip.1997.638728

Z. Razak, K. Zulkiflee, M. Yamani, I. Idris, E. M. Tamil, M. Noorzaily, M. Noor, R. Salleh, M. Yaakob, Z. M. Yusof, and M. Yaacob. Off-line handwriting text line segmentation: a review. International Journal of Computer Science and Network Security, 8(7):12–20, 2008.

John M. Trenkle, Steve Schlosser, and S. Gillies. An off-line Arabic recognition system for machine-printed documents. In Symposium on Document Image Understanding Technology, At Annapolis, MD, pages 155–161, 1997.

Giovanni Seni and Edward Cohen. External word segmentation of off-line handwritten text lines. Pattern Recognition, 27(1):41 – 52, 1994.
http://dx.doi.org/10.1016/0031-3203(94)90016-7

T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464– 1480, Sep 1990.
http://dx.doi.org/10.1109/5.58325

S. Al-Ma’adeed, D. Elliman, and C.A. Higgins. A data base for arabic handwritten text recognition research. In Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on, pages 485–489, 2002.
http://dx.doi.org/10.1109/iwfhr.2002.1030957

V. Barnett and Lewis T. Outliers in Statistical Data. John Wiley Sons, 3rd edition, 1994.
http://dx.doi.org/10.1002/bimj.4710370219

F. E. Grubbs. Procedures for detecting outlying observations in samples. Techno metrics, 11(1):1–20, 1969.
http://dx.doi.org/10.1080/00401706.1969.10490657

Alexander Strehl. Relationship-based clustering and cluster ensembles for high-dimensional data mining. PhD thesis, The University of Texas at Austin, May 2002.

Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, April 1987.
http://dx.doi.org/10.1016/0377-0427(87)90125-7


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize