65
A Systematic Mapping Study on OCR Techniques
A. Zafar Mehmood, B. Dr.Muddesar Iqbal,C. Muhammad Ali, D. Zahid Iqbal, E. Naveed Anwar Butt
Faculty of Computer Science and Information T University of Gujrat .Pakistan
zafar.mehmood@uog.edu.pk
Faculty of Computer Science and Information T University of Gujrat .Pakistan
m.iqbal@uog.edu.pk
Faculty of Computer Science and Information T University of Gujrat .Pakistan
Muhammad.ali@uog.edu.pk
Faculty of Computer Science and Information T University of Gujrat .Pakistan
zahid.iqbal@uog.edu.pk
Faculty of Computer Science and Information T University of Gujrat .Pakistan
Naveed@uog.edu.pk
Abstract
This paper presents a systematic review on optical character recognition techniques. In this review paper we will present an overview of different technique used in OCR through systematic mapping. Which type of research is used for exploring OCR techniques (survey, case study, solution proposal etc.)? Publication channel for paper using different techniques for OCR. Most prominent technique used by the researcher. Number of relevant research paper in well-known data sources. We tried to cover in this paper all the areas belonging to OCR, its usage and implementations. Our research found 56 categorized primary studies, we observe that mostly OCR tool designed using neural networks techniques, and most of the time researcher follow solution proposal and validation research type for carrying their research. Mostly academic environment is place where research about OCR is conducted. We believe that findings of this systematic mapping study will help the researcher to know about the different techniques used for OCR and research types (methods) used for exploring these techniques.
Keywords: OCR Techniques, SMS on OCR, Systematic Mapping Study on OCR.
I. INTRODUCTION
Before the invention of computer all the documents are in the form of hand written format. With the passage of time the documents were going to destroyed with environmental hazards. As the computer having ability to store, share documents so as the invention of computer the attention diverted to save the documents electronically, as it is durable and cheap solution. All the documents which is in the form of paper documents are converted into editable text using scanner and intelligent software which converts the image data to editable text such kind of system called optical character recognition systems (Chan et al,2005).
66
and which needs to more research (Petersen et al,2008). Systematic Review is core technique in which detailed study conducted. To gain in depth understanding of any study area Systematic Review is used instead of Systematic mapping study. Because in systematic review an in depth study is conducted and every aspect is discussed so it takes too much time to complete systematic review (Herrmann et al,2009)(Davis et al,2009)(Ahmed et al,2006). As we have observed by applying initial SMS techniques on OCR that mostly neural network are used and mostly research conducted in academic area. The online data sources provide a brief review about such techniques.
This paper shows the results of systematic mapping study to recognize and classify all major studies covering every eminence research technique of OCR and product currently being analysed by the researchers. Our mapping study addresses the some research questions that are as under (RQ): 1) What research techniques of OCR are mostly explored? 2) In which study distribution settings these OCR research techniques are investigated? 3) what research types are investigated for OCR research techniques? and 4) Which research method was applied mostly in the assessment of the OCR research techniques?
In the next section we will explain our systematic mapping study process, and Section 3 addresses the limitations found and discusses inferences identified from this study.
II. The Systematic mapping process
We conduct our systematic mapping study in three stages, which are discussed below.
A. Stage 1: Defining Scope, search strategy and selection criteria
The scope of this study was as follows:
Population. All articles explaining studies either from academia, or industry or from government side.
Intervention. All articles connecting OCR and its methods, techniques and tools. Outcomes. Frequency and
category of facts relating to the discussion and implementation of OCR research techniques. Study design.
Solution proposal, validation research, evaluation research, review paper, experience reports.
The search strategy was based on the designing of search terms and the selection of search resources. Our search terms includes the search string that related to the our area in which we conduct the study: (1) OCR, (2) OCR techniques, (3) OCR methods, (4) OCR strategies, (5) OCR approaches, (6) optical character recognition techniques, (7) optical text recognition strategies, , (8) optical number recognition techniques, (9) optical character matching techniques, (6) optical character detection techniques, (10) optical character identification techniques. Then, we used Boolean OR and Boolean AND to combine our terms. We used Boolean OR to combine alternate terms and synonyms, and Boolean AND to combine two main parts of a term. A complete search string is given for example that we write for CiteseerX after following their string making rules. String is (OCR) OR (OCR AND (Techniques OR Strategies OR Methods OR Approaches)) OR (Optical AND (Character OR Text OR Number) AND (Recognition OR Matching OR Detection OR Identification) AND (Techniques OR Strategies OR Methods OR Approaches)) OR (Optical OR (Character OR Text OR Number) AND (Recognition OR Matching OR Detection OR Identification) AND (Techniques OR Strategies OR Methods OR Approaches)).
We choose three search resources that are IEEE Digital Library, ACM Digital Library, and CiteseerX. We used strings on research resources to get the results. To choose papers from results generated by
67
to select relevant papers: I1: Any paper discusses the one or more OCR research techniques. And that
discussion or study conducted by either academic or industrial or government side or their collaborations;
I2: The paper contains the comparison study of two or more OCR research techniques. I3: If any paper
contains study that is also presented in other papers. We considered all of them. I4. If a paper report more
than one technique we treat each technique as a separate paper. I5. We include those papers in which we
only access the abstract.
We also considered the following exclusion criteria: E1. We exclude those papers that did not report the
technique used for OCR. E2: Any paper whose title, keywords, abstract is not accessible is excluded; E3:
We excluded the material in the form of posters, summaries of articles, tutorials, panels and slides. E4: We
excluded books, magazines.
B. Stage 2: Selecting primary studies/Initial findings
68 TABLE I
PUBLICATION CHANNELS (1993-2012)
Acronym
Type of
publication Percent
Anonymous Anonymous 39.7
TPAMI Journal 8.6
Pattern
Recognition Journal 8.6
ICDAR Conference 6.9
IVC Journal 5.2
SDIUT Conference 3.4
VCIR Conference 1.7
ACT Conference 1.7
AlRD Conference 1.7
Digital
Libraries Conference 1.7 Multimedia Conference 1.7
SMCV BC Conference 1.7
DAS Conference 1.7
Electronic
Imaging Journal 1.7
RIDL Workshop 1.7
VLSI SPS Journal 1.7
TALIP Journal 1.7
TITSJ Journal 1.7
ANUTD Workshop 1.7
Multilingual
OCR Workshop 1.7
PUI Workshop 1.7
69
C. Stage 3: Classification of selected studies
Our classification criteria consist of OCR research techniques studied, type of research methods study, study distribution setting:
• OCR research techniques studied: this refers to the research techniques used in the study or discussion of OCR. There are some techniques that are used such as neural network, genetic programming, fuzzy model, ad hoc code based etc.
• Type of research methods study: We consider seven types of research methods, partly based on (Fernandez et al,2009)(Petersenet al,2008)(Tonella et al,2009): solution proposal, validation research, evaluation research, review paper, solution proposal and validation research combined, solution proposal, evaluation research combined, solution proposal, validation research and evaluation research combined. The classification is done according to these criteria.
• Study distribution setting: this refers to the areas where this study takes place. Such as academic environment, industry environment or government side. We also considered the different combinations of Academic and industry.
According to the above mentioned criteria, our four RQs were addressed, which are analyzed below. 1) Which are the largely explored OCR research techniques?
42 OCR research techniques were found in this systematic mapping study. Table 2 reports the top-ten most studied OCR research techniques: Neural network, Anonymous, Ad hoc, Word Shape Coding. 19.0% of the studies focus on neural network. And if we combine all other neural network types such as hidden markov then percentage of neural networks increases to 29.3%. We found also 38 OCR research techniques that having only one time happening. Bases on above finding we suggest that there is much need to do research to understand these OCR research techniques. Next, we analyse the relation between the OCR research techniques and other criteria considered in this study.
2) In which study distribution settings these OCR research techniques are investigated?
70 TABLE II
OCRTECHNIQUES MOST INVESTIGATED
Research Technique Frequency Percent
Neural network 11 19.0
Annonymous 5 8.6
Adhoc 2 3.4
Word Shape Coding 2 3.4
DT–CNNs and classifier
combining 1 1.7
higher-level parsing
routines 1 1.7
Generalized Hough
Transform 1 1.7
Fuzzy Hough Transform
and MLP 1 1.7
dissection, spatial, hybrid,
holistic 1 1.7
design principles 1 1.7
Database-driven 1 1.7
clustering algorithm 1 1.7 character shape coding 1 1.7 Brick Wall Coding (BWC) 1 1.7
machine learning 1 1.7
Perspective Invariants 1 1.7
hybrid architecture 1 1.7
TABLE III
DISTRIBUTION OF STUDY SETTING
Study Setting Frequency Percent
Academic 44 75.9
Industry 8 13.8
Anonymous 3 5.2
Academic,
Industry 2 3.4
Government 1 1.7
71
Study Distribution setting at Y-axis. Here we use numbering that start from 5 for both type and every different technique and distribution study takes unique value to its other members. In Fig. 1. size of the circle show the percentage at some intersection point of research techniques and study distribution settings.
3) what research types are investigated for OCR research techniques?
Table 5 shows that the leading research type is solution proposal and validation research jointly and then solution proposal on second number.
50 50 60 40 100 100
100100100100 100100
100 100
100100100 100
100
100100100100100 100
10010010089 11 100 100100 100 100100 100
100100100100100100100100100
-5 0 5 10 15 20 25 30
-5 0 5 10 15 20 25 30 35 40 45 50 55
Research Technique D is tr ibu ti on S tudy
Figure. 1. Bubble plot for OCR techniques studied - types of distribution settings
TABLE IV
OCR TECHNIQUES STUDIED-TYPES OF DISTRIBUTION SETTINGS
Research Technique
values are in % Aca
d em ic A ca d em ic, Indu st ry A n o n y mo
us Gov
er n me nt Indu st ry
Ad hoc 50 0 50 0 0
Anonymous 60 0 0 0 40
Approximate
Stroke Sequence 100 0 0 0 0
Brick Wall
Coding (BWC) 0 100 0 0 0
Cascade Ensemble
Classifier System and Hybrid Features
100 0 0 0 0
character shape
coding 100 0 0 0 0
clustering
algorithm 100 0 0 0 0
Database-driven 100 0 0 0 0
design principles 0 0 0 0 100
dissection, spatial
72 holistic
DT–CNNs and
classifier combining
100 0 0 0 0
Fuzzy Hough Transform and an MLP
0 0 100 0 0
Fuzzy model 100 0 0 0 0
Generalized Hausdorff Image Comparison
100 0 0 0 0
Generalized
Hough Transform 100 0 0 0 0
Hidden Markov 0 100 0 0 0
hidden Markov
model 0 0 0 0 100
hidden Markov
Neural Network 100 0 0 0 0
higher-level
parsing routines 100 0 0 0 0
higher-level parsing routines, Lisp Tex
Expressions
100 0 0 0 0
hybrid
architecture 100 0 0 0 0
hybrid
wavelet/neural network
100 0 0 0 0
Lexicon-driven segmentation and recognition
0 0 0 0 100
machine learning 100 0 0 0 0
multi-plane approach for text segmentation
100 0 0 0 0
multi-staged
technique 100 0 0 0 0
Neural network 89 0 11 0 0
neural network models, machine learning methods
0 0 0 100 0
Neural Networks 100 0 0 0 0
octal graph conversion and the metrics based on ranks
100 0 0 0 0
73 Operator Context
Scanning algorithm
100 0 0 0 0
Perspective
Invariants 100 0 0 0 0
Recurrent Neural
Networks 0 0 0 0 100
semi-automatic
and adaptive 100 0 0 0 0
Sensor network 100 0 0 0 0
Sparse Pixel Character Vectorisation algorithm (SPCV)
100 0 0 0 0
Staff and graphical primitive segmentation
100 0 0 0 0
text indexing and
retrieval 100 0 0 0 0
Three Stage Technique (Segmentation, Feature
Extraction, Classification)
100 0 0 0 0
two-stage scheme 100 0 0 0
user-centered
approach 100 0 0 0
Word Shape
Coding 100 0 0 0 0
We also provide bubble chart to report the percentage, shown in Fig. 2. In Fig. 2. we take OCR research techniques at X-axis and research types at Y-axis. Here we use numbering that start from 5 for both type and every different technique and research type takes unique value to its other members. In Fig. 2. size of the circle show the percentage at some intersection point of research techniques and research type.
4) Which research method was applied mostly in the assessment of the OCR research techniques?
74
Figure. 2. Bubble plot for OCR techniques studied - types of research
TABLE V
OCR RESEARCH TECHNIQUES STUDIED-TYPES OF RESEARCH
R es ea rc h T ec hn ique V a lu es i n % E va lu a ti o n R es ea rch R ev ie w P a p er S o lu ti o n P ro p o sal S o lu ti o n , E va lu a ti on S o lu ti on , V a li d a ti on S o lu ti on , V a li d a ti on , E va lu a ti on V a li da ti o n R es ea rch
Adhoc 0 0 50 0 50 0 0
Annon
ymous 0 80 0 0 0 0 20
Approx imate Stroke Sequen ce
0 0 100 0 0 0 0
Brick Wall Coding (BWC)
0 0 0 0 100 0 0
Cascad e Ensem ble Classifi er System and Hybrid
0 0 100 0 0 0 0
50 50 80 20 100 1 00 10 0 10 0 100 100100 100 100 100100100 100100 1 00 1 00 10 0100100 100100 100100 10044 56 1 00
50 50 100 100 100100100100 100 100100 100 1 00 1 00 10 0 10 0 0 5 10 15 20 25 30 35 40 -1 0
75 Feature
s charact er shape coding
0 0 0 0 100 0 0
clusteri ng algorit hm
0 0 100 0 0 0 0
Databa se-driven
0 0 0 0 100 0 0
design princip les
0 0 0 0 100 0 0
dissecti on,spat ial feature s,hybri d,holist ic
0 100 0 0 0 0 0
DT–
CNNs and classifi er combin ing
0 0 0 0 0 100 0
Fuzzy Hough Transf orm and an MLP
0 0 0 0 100 0 0
Fuzzy
model 0 0 0 0 100 0 0
Genera lized Hausdo rff Image Compa rison
0 0 0 0 100 0 0
Genera lized Hough Transf
76 orm
Hidden Marko v
0 0 100 0 0 0 0
hidden Marko v model
100 0 0 0 0 0 0
hidden Marko v Neural Networ k
0 0 0 0 100 0 0
higher-level parsing routine s
0 0 100 0 0 0 0
higher-level parsing routine s,Lisp Tex Expres sions
0 0 100 0 0 0 0
hybrid archite cture
0 0 100 0 0 0 0
hybrid wavele t/neural networ k
0 0 0 0 100 0 0
Lexico n-driven segmen tation and recogni tion
0 0 0 0 100 0 0
machin e learnin g
0 0 0 0 0 100 0
77 approa
ch for text segmen tation multi-staged techniq ue
0 0 100 0 0 0 0
Neural Networ k
0 0 44.
4 0
55.
6 0 0
neural networ k models , machin e learnin g method s
0 0 0 0 100 0 0
Neural Networ ks
0 50 50 0 0 0 0
octal graph conver sion and the metrics based on ranks
0 0 0 0 100 0 0
Old Techni ques
0 100 0 0 0 0 0
Operat or Contex t Scanni ng algorit hm
0 0 0 0 100 0 0
Perspe ctive Invaria
78 nts
Recurr ent Neural Networ ks
0 0 0 0 100 0 0
semi-automa tic and adaptiv e
0 0 0 0 100 0 0
Sensor networ k
0 0 100 0 0 0 0
Sparse Pixel Charac ter Vectori sation algorit hm (SPCV )
0 0 0 0 100 0 0
Staff and graphic al primiti ve segmen tation
0 0 0 0 100 0 0
text indexin g and retrieva l
0 0 0 100 0 0 0
Three Stage Techni que (Segme ntation, Feature Extract ion,Cla ssificat ion)
0 0 0 0 100 0 0
79 stage
scheme user-centere d approa ch
0 0 100 0 0 0 0
Word Shape Coding
0 0 0 0 100 0 0
TABLE VI
DISTRIBUTION OF RESEARCH METHODS
Research Type Frequency Percent Solution, Validation 27 46.6
Solution Proposal 17 29.3
Review Paper 7 12.1
Solution, Validation,
Evaluation 3 5.2
Solution, Evaluation 2 3.4 Validation Research 1 1.7 Evaluation Research 1 1.7
III. Discission
This mapping study shows that which OCR research technique being explored with respect to the implementation of OCR, the study distribution settings where study of OCR conducted, and also list down the research types were being applied for such study.
80 References
i. R. Pretorius, D. Budgen: “A mapping study on empirical evidence related to the models and forms
used in the uml”, Proceedings of the Second International Symposium on Empirical Software
Engineering and Measurement, ESEM, October 9-10, 2008, Kaiserslautern, Germany, pp. 342-344
ii. J. Bailey, D. Budgen, M. Turner, B. Kitchenham, P. Brereton, S. Linkman, "Evidence relating to
Object-Oriented software design: A survey," First International Symposium on Empirical Software Engineering and Measurement (ESEM), 2007, pp.482-484.
iii. Chan, Yung-Kuan, Yu-An Ho, Hsien-Chu Wu, and Yen-Ping Chu. "A Duplicate
Chinese Document Image Retrieval System." (2005): 1-6.
iv. Nelly Condori-Fernandez, Maya Daneva, Klaas Sikkel, Roel Wieringa, Oscar Dieste, Oscar Pastor,
“A Systematic Mapping Study on Empirical Evaluation of Software Requirements Specifications
Techniques”, Third International Symposium on Empirical Software Engineering and
Measurement, 2009 pp 502-505
v. Petersen, Kai, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. "Systematic mapping studies
in software engineering." In 12th International Conference on Evaluation and Assessment in
Software Engineering, vol. 17, p. 1. 2008.
vi. A.Herrmann, M. Daneva: Requirements Prioritization Based on Benefit and Cost Prediction: An
Agenda for Future Research, 16th International Requirements Engineering Conference, IEEE Computer Society, 8-12 September 2008, Barcelona, Spain, pp. 125-134
vii. M. Davis, Ó. Dieste, A. M. Hickey, N Juristo, A. Moreno, “Effectiveness of Requirements
Elicitation Techniques: Empirical Results Derived from a Systematic Review”, 14th International
Conference on Requirements Engineering (RE 2006), IEEE Computer Society, 11-15 September 2006, Minneapolis, USA, pp. 176-185.
viii. P. Tonella, M. Torchiano, D. Du Bois, T. Systa, “Empirical studies in reverse engineering: state of
the art and future trends”, Empirical Software Engineering Journal, Springer, 2007
ix. J. Calmon, P. Gomes, A. Cruz, T. Uchôa, G. Travassos (2007): “Scientific research ontology to
support systematic review in software engineering”. Advanced Engineering Informatics 21(2):
133-151
x. K. Petersen, R. Feldt, M. Shahid, M. Mattsson, “Systematic Mapping Studies in Software
Engineering”, 12th International Conference on Evaluation and Assessment in Software
Engineering (EASE), Department of Informatics, University of Bari, Italy, June 2008.
xi. K. Ahmed, “A Systematic Review of Software Requirements Prioritization”, Master Thesis in
81
Zafar M ehmood Khattak has received his BS degree from Peshaw ar Universit y, M CS degree from Kohat Universit y in 2006, M SCS degree from Universit y of Gujrat in 2013. Currently he is w orking is Lect ure Compuer Science/ Deput y M anger ORIC in Universit y of Gujrat . He completed his M SCS in Wireless Net w orking, his research int erst included comput at ional int elligence and wireless net w ork.
Dr. M uddesar Iqbal has done PhD from Kingst on Universit y UK in t he area of Wireless M esh Net w orks ent it led " Design, Developm ent and Implem ent at ion of a High Performance Wireless M esh Net w ork f or Applicat ion in Emergency and Disast er Recovery" . He has undert aken research project bot h in t he area of healt hcare and disast er recovery. He w on anot her Aw ard of Appr eciat ion from ABE UK for t ut oring t he prize w inner (from 63 count ries and 541 Inst ut es/ Colleges) in Informa on Syst em Project managem ent module in 2010. He has published 13 papers, all in Int erna onal Journals and proceedings in t he area of Wireless Net w orks t arget ing it s applicat ion in Healt hcar e and Emergency and Disast er Recovery.
Zahid iqbal received t he B.S. degree in Comput er Science from Universit y of Punjab and M .S. degree in Comput er Science from NUCES, Zahid Iqbal has received has BS degree from Panjab and M .S. degree in Comput er Science from NUCES, FAST Universit y, Pakist an in 2010 and 2012 respec vely. He w as a lect urer wit h Departm ent of IT, Universit y of Punjab, PK. Since 2012, he is a Lect urer with t he Facult y of Com pu ng and IT, Universit y of Gujrat , Punjab, PK. His research int erest s include evolutionary algorit hm s, sw arm int elligence, art ificial neural net w orks, com put at ional int elligence in dynam ic and uncertain environm ent s and real-w orld applicat ion.
Naveed Anw er But received t he B.S.degree in Comput er Science from Universit y of Punjab and M .S. degree in Comput er Science from Islamic Int ernat ional Universit y Islamabad, and current ly he is doing his PhD in t he field of Dat a M ining. He is Assist ant Professor in Facult y of Com puting and IT, University of Gujrat , Punjab, PK. His research int erest s include Dat a w are housing, net w ork securit y and dat a mining.