COMPARATIVE STUDY OF DIFFERENT CLASSIFIERS FOR DEVANAGARI HANDWRITTEN CHARACTER RECOGNITION

(1)

COMPARATIVE STUDY OF

DIFFERENT CLASSIFIERS FOR

DEVANAGARI HANDWRITTEN

CHARACTER RECOGNITION

ANILKUMAR N. HOLAMBE ,( Research Student)

Department of Information Technology College Of Engineering Osmanabad-413501(M.S.)India e-mail: anholambe@yahoo.com

DR.RAVINDER.C.THOOL Department of Information Technology

Shri Guru Gobind Singhji Institute Of Engg & Technology, Vishnupuri, Nanded-431606 (M.S.) India

e-mail: rcthool@yahoo.com Abstract :

In this paper we deals with the recognition of off-line handwritten Devanagari characters. The paper presents an experimental assessment of the efficiency of various classifiers in terms of accuracy in recognition. Here we have used one feature set and 21 different classifiers for our experiment. The paper gives an idea of the recognition results of different classifiers and provides new benchmark for future research; here we have reported comparative study of Devanagari handwritten character recognition by using classifiers.

Keywords: - classifiers; recognition; handwritten character; Devanagari; feature;

1. Introduction

Handwriting recognition is described as the ability of a computer to translate human writing into text. This can be achieved by two ways, the first of these handwriting recognition techniques, known as optical character recognition (OCR), is the most successful. OCR is also used to convert large quantities of handwritten documents into searchable, easily-accessible digital forms. The second technique of handwriting recognition, often referred to as on-line recognition. Methods and recognition rates depend on the number of constraints on handwriting. The constraints are mainly characterized by the types of handwriting, the number of writers, the size of the dataset and the spatial layout. From past many years, many academic laboratories and companies are involved in research on handwriting recognition. The increase in accuracy of handwriting processing results from a combination of several elements i.e. the use of complex systems integrating several kinds of information, the choice of relevant application domains, and new technologies such as high quality high speed scanners and inexpensive powerful CPUs. In Character recognition system we required two things i.e. preprocessing on data set and decision making algorithms. We can categories preprocessing into three categories: the use of global transforms, local comparison and geometrical or topological characteristics. There are various kinds of decision methods have been used such as: various statistical methods, neural networks, structural matching and stochastic processing. Many recent methods developed by combining several techniques existing together in order to provide a better reliability to compensate the great variability of handwriting.

2. Devanagari script and dataset

(2)

rich set of conjuncts. A syllable ("akshar") is formed by a vowel alone or any combination of consonants with a vowel. Optical Character Recognition for Devanagari is highly complex due to its rich set of conjuncts. There are about 280 compound characters in Devanagari [22]. Due to the lack of uniform data sets for Devanagari script OCR there is very little research is going on. A best survey report on research in Devanagari is given[1,22].We have implemented different classifiers and taken the results so that it may be the bench mark for future research. In the present work we have developed handwritten database. For handwritten we have collect data from people of different age groups and from different profession. We have visited primary school, secondary school, High School, Government offices, Adult Education Night School for collecting data. This data were scanned at 300 dpi using a HP flatbed scanner and stored as gray-level images. The preprocessing steps performed in this work are steps for rectification of distorted images, improving the quality of images for ensuring better quality edges in the subsequent edge determination step and size normalization

.

Fig. 1. Datasheet used for data collection

3. Feature Extraction

In this technique of feature extraction binary image of Character is partitioned into fixed number of sub blocks. Step1: Start

Step2: If image is gray scale convert it into binary image. Step3: Convert the binary image into 9 Sub block. Step4: If end points of Character OR part lies in any of these sub block then assign 1 else 0.

Step5: If image touch the partition line then assign 1else 0.

Step6: Combine Strep4 and Step5 to get the feature Vector.

Step7: End.

(3)

Fig. 3. Example of Character in the sub block

From above algorithm the feature vector for each character consists of V={SB1, SB2, SB3, SB4, SB5, SB6, SB7, SB8, SB9,L1, L2, L3, L4, L5, L6, L7, L8, L9, L10, L11, L12},for character tha shown in figure.2. the vector ={1,1,1,1,1,0,1,1,0,1,1,0,1,1,1,0,0,1,1,1,0} similarly we have calculate for all characters and numbers of Devanagari.

4. Brief Description of Classifiers

In this paper, we are finding the classification accuracy of different classifiers. We have also calculated average error rate of each classifier. We are giving the brief of each classifier for details please refer the given references.

4.1.Euclidian Distance (ED)

The Euclidean distance between the input pattern and the mean vector is defined by

2 1 2

||

)

(

x

X

M

g

i





₍₁₎

Where X is the input feature vector of size (dimensionality) n, Ml is the mean vector of class l. The input vector is classified to such class l* that minimizes the Euclidean distance. Hereafter the subscript l denoting the class is omitted for the sake of simplicity[2].

4.2.Projection Distance (PD)

The projection distance is defined by

















k

i T i

pd

x

X

M

X

m

g

1

2 2

1 2

)

(

||

)

(

(2)

and gives the distance from the input pattern X to the minimum mean square error hyper plane that approximates the distribution of the sample, where



, denotes the ith eigenvector of the covariance matrix, and k is the dimensionality of the hyper plane as well as the number of the dominant eigen vectors (k<n). When k=0 the projection distance reduces to the Euclidean distance[6].

4.3.Subspace method (SM):

For a bipolar distribution on a spherical surface with ||x||= 1 the mean vector M is a zero Vector (M=0) because the distribution is symmetric I respect to the origin. Then the projection distance for the distribution is given by















k

i T i

X

x

g

1

2 2

(

1 )

(

(3)

Where



i_{is the ith eigenvector of the autocorrelation matrix? The second term of the above expression is used} as the similarity measure of CLAFIC (Class featuring Information Compression) and the subspace method [7]. 4.4.Modified projection Distance (MPD)

The modified projection distance is defined by

 

_

_



_













   

 



 k

i

T i i

i _X _M

M X X g

1

2 2

1 1

  

 

(4)

Where



is parameter which takes [0,1]. When





0

, this classifier gives the same value as that of Projection Distance, When





1 ,

this gives the same value as that of Euclidian Distance. The value of



we used here is decided by preliminary experiment [2].

4.5.Parzen Window classifier (PWC)

The goal is to describe the Parzen window classifier,  _{ } _         ) tan ( ker ) 6 ( , , , | ˆ ) 5 ( , | 1 | ˆ 1 , , t Cons tive multiplica irrelevant an to up form the of nel Gaussian a typically is K Where x x K x x K y x p rule Bayes by and x x K y y i y x p n i i y y i i y y i i i j i        



x

,

x



exp(

||

x

,

||

2

/

2 

2

).

K

_i





The reason for considering this simple classifier with active learning is that equation (6) gives directly an estimate of the posterior probability and that it does need a costly retraining when a point is added to the training set[3].

4.6.A modified Quadratic Discriminant function (MQDF):

A modified Quadratic Discriminant function (MQDF) is used by the quadratic classifier [4] which is defined by,









) 7 ( ]] || || [ 1 [ ) 1 ( ) ( 1 2 0 2 1 2 2 2 0 0



                    k i i T i k i o i i N N In M X N N M X N i In N N N x g      

Where, x is the feature vector of an input character, M is a mean vector of samples, T i



is the

i

theigen vector of the sample covariance matrix,



i_{is the}

i

th_{eigen value of the sample covariance matrix, n is the} feature size,



2 is the average variance of all classes, N is the average sample of all classes, and

N

0_{is selected}

experimentally.

4.7.GLVQ Algorithm(GLVQ):

Proximity of an input feature vector X to its own class is defined by (8) as before, where d1 is the distance between X and the reference vector R(1) of the class to which X belongs. And d2 is the distance between X and the nearest reference vector R(2) of the class to which X does not be longs.

)

(

)

(

)

(

)

(

)

(

2 1 2 1

x

d

x

d

x

d

x

d

x







(8)

The purpose of the learning is to reduce the incorrect classification by reducing the value of



for as many X as possible. The GLVQ is formalized as a minimization problem of an evaluation function Q defined by





N i

i

f

Q

()

)

,

(



(9)

Where N is the total number of the input vectors, and

f

(



)

is a monotonically increasing function of



_.

(5)

)

11 )(

(

)

(

.

)

10 )(

(

)

(

.

) 1 ( 2 2 1 1 ) 2 ( ) 2 ( 1 ) 1 ( 2 2 1 2 ) 1 ( ) 1 ( 1 t t t t t t

R

x

d

f

R

x

d

f

R





















 









Which are derived by the steepest descent method. In the experiment, the first derivative of the function f is defined by

)),

,

(

1 )(

,

(

t

f

t

f







(12) Where t is the number of iteration, and f(µ,t) is the sigmoid function defined by

t x

e

t

f

₍ ₎

1

1 )

,

(



__





(13) As the initial reference vectors, the mean vector of each class are used[5]. 4.8.Polynomial Classifier(PC):

The task of classifier adaptation is equivalent to finding the best function d according to a suitable optimization criterion Common mean-square-optimization leads to

,

}

|

)

(

)

(

{|

2 ! min

d

x

d

x

y

E



_

(15)

Where E{…} stands for the expectation operator and is called residual variance, d(x) is the result of the classification, and y(x) is the desired output for the feature vector x.

The approximation of the vector function d can be made, for example, by a functional classifier or a multilayer perceptron. The essential idea in the functional classifier approach is to connect the input. Variables in a non-linear way

f

i

(

x

)

_{thus enhancing the features – but to use only a linear function for the output layer (matrix A):}





j T t i j ji

i

x

a

f

x

a

f

x

or

d

x

A

f

x

d

(

)

(

)

,

(

)

(

)

(

).

(16)

Therefore, once the functional connections have been generated, the functional classifier becomes merely a linear classifier of the enhanced features

f

j

(

x

)

_{, and therefore, the mathematics of solving linear problems can} be applied. The only two steps required to determine the coefficients of A are, first, to compute the cross-correlation matrix

{

(

)

}

T

y

x

f

E

_{and the moment matrix of the enhanced features}

E

{

f

(

x

)

f

(

x

)

T

}

_{from the}

learning set, and second, to simply solve a set of linear equations; T T

y

x

f

E

A

x

f

x

f

E

{

(

)

(

)

}



{

(

)

₍₁₇₎

The solution to this equation is found only by matrix inversion which is a weel understood mathematical technique. Therefore, no iterative learning is necessary. Additionally, it is assured that the coefficients are optimal for the actual task or learning set[8].

4.9.Fisher Linear Discriminant(FLD):

Fisher Linear Discriminant (FLD) is an example of a class specific subspace method that finds the optimal linear projection for classification. Rather than finding a projection that maximizes the projected variance as in principal component analysis, FLD determines a projection,

y

W

X

,

T F



_{that maximizes the ratio between the} between class scatter and the within class scatter. Consequently, classification is simplified in the project space.

Consider a C-class problem, with the between-class scatter matrix given by









c i T i i i B

N

S

1

)

)(

(



(18) and the within- class scatter matrix by



     1 ) ( ) (

1 x X

T i k i k c i w k x x

S  

(6)

Where



is the mean of all samples,



i_{is the mean of classes I, and}

N

i_{is the number of samples in class} i. The optimal projection

W

f is the projection matrix which maximizes the ratio of the determinant of the between class scatter to the determinant of the within class scatter of the projections

] ... [

max 1 2 m

w t

B r

w W S W

W S W

Wf     

(20)

Where

{



i



1 ,

2 ,..,

m

}

is the set of generalized eigenvectors of SB and SW, corresponding to the m largest generalized eigen values

{



_i

i



1 ,

2 ,...,

m

}

[9].

4.10. Linear and quadratic Classifiers(LC,QC):

A quadratic form in x defines the decision boundary of a quadratic classifier, derived through Bayesian error minimization. Assuming that the distribution of each class is Gaussian, the classifier output is given by

2 1 2 1 ) 1

2 ( 2 )

2 ( 2 1

) 1 1 ( 1 )

1 ( 2 1 ) (

  

  



  

 

In x

T x

x T x x f

 

(21)

Where



i_and



i (i1,2) _{are the mean and covariance matrix of the respective Gaussian distributions} A linear classifier is a special case of the quadratic form, based on the assumption that



1 



2 



,

. Which simplifies the discriminant to?

)

2 μ

1 T

2 μ

1 μ

1 T

2 (

μ

2

1 x

1 )

1 μ

2 (

μ

f(x)

















(22)

For both classifiers, the sign of F (x) determines class membership and is also equivalent to a likelihood ratio test. [2].

4.11. Nearest Neighbor Rule Based Classification Methods Nearest Neighbor Classification (NNC):

To Classify a new vector x, a training data set

(

,

)

y y NN

x

c

Td

Where Y = 1 ……. M

1. At each of the store point calculate the dissimilarity of the test point X.

(

,

)

y y

_d

_x

d



_.

2. Find the training point

x

ywhich is closest to x by finding that

*

y

_{such that} y y

x

*



_{for all y=}

1….. m. 3. Assign

*

)

(

_x

_c

y

c



_{belongs to category}

c

y

of

its

nearest

neighbor

x

y_.

If there are two or more equidistant or dissimilar points with different class labels. The numerous class is taken. If there are two or more numerous class then k-nearest neighbor is used.[10][11].Here we are using different K-NN based Methods for our experiment. These are Euclidian Distance-based K-K-NN Classification(EDKK-NN), Cosine Similarity based K-NN Classification(CSKNN), Pearson Correlation-based K-NN Classification(PCKNN), Condensed Nearest Neighbor Classification(CNN), Reduced Nearest neighbor classification.(RNN) Farthest like neighbor (FNN)and Nearest unlike Neighbor classification (NUN)[12][13][14][15][16].

4.12. Support Vector Machines

A Support Vector Machine is a learning algorithm for pattern Classification and regression [17]. Given a labeled

set of M training Samples i

N i i

i

y

Where

x

N

and

y

x

,

),



(

(7)











M

i i

i

k

x

b

y

x

f

(

)



(

,

)

(23)

Where

k

(



,



)

is a kernel function and the sign of f(x) determines the membership of x. Constructing an optimal hyperplane is equivalent to finding all the nonzero



i._{Any vector}

x

i_{that corresponds to a nonzero}



i._is

supported vector (SV) of the optimal hyperplane. A desirable feature of SVM’s is that the number of training points which are retained as support Vectors is usually quite small, thus providing a compact classifier.

For a Linear SVM, the kernel function is just a simple dot product in the input space while the kernel function in a nonlinear SVM effectively projects the samples to feature space of higher (possible infinite) dimension via a nonlinear mapping function :

N

M

F

R

N



M





:

,

₍₂₄₎

and then constructs a hyperplane in F. The motivation behind this mapping is that it is more likely to find a linear hyperplane in the high dimensional feature space. Using Mercer’s theorem [8], the expensive calculations required in projecting samples into the high dimensional feature space can be replaced by a much simpler kernel functions satisfying the condition.

)

(

)

(

)

,

(

x

_i

x

_i

k









(25)

Where



is the nonlinear projection function. Several kernel functions, such as polynomials and radial basis functions, have been shown to satisfy Mercer;s theorem and have been used successfully in nonlinear SVMs. In fact, by using different kernel functions, SVMs can implement a variety of learning machines, some of which coincide with classical architectures.

4.13. Radial Basis Function Networks

A radial basis function (RBF) network is also a kernel based technique for improved generalization, but it is based instead on regularization theory [18]. Atypical RBF network with K Gaussian Basis functions is given by







K

i

i i

i

g

x

c

b

w

x

f

(

)

(

;

,



2

)

(26) Where the G is the ith Gaussian basis function with center

c

i_{and variance}

2

i



. The weight coefficients

w

i combine the basis functions into a single scalar output value, with b as a bias term. Training a Gaussian RBF network for a given learning task involves determining the total number of Gaussian basis functions, locating their centers, computing their corresponding variances, an solving for the weight coefficients and bias. Judicious choice of k,

c

i

and

2

i



can yield RBF networks which are quite powerful in classification and regression tasks. The number of radial bases in a conventional RBF network is predetermined before training, whereas the number for a large ensemble RBF network is iteratively increased until the error falls below a set threshold. The RBF centers in both cases are usually determined by k-means clustering

.

4.14. Hidden Markov Model (HMM)

Hidden Markov Models are suitable for handwriting recognition for a number of reasons [19]. In the meantime, HMM’s have also been successfully applied to image pattern recognition problems such as shape classification [20] HMM’s qualify as suitable tool for cursive script recognition for a number a reasons. Finally, there are standard algorithms known from the literature for both training and recognition using HMM’s. These algorithms are fast and can be implemented with reasonable effort. Kundu and Bahl built an HMM for the English language [21] However, we have used it for our Experiment.

5. Result and discussions

(8)

these are some of the confusing characters. By analyzing the above result we find that SVM give high accuracy and low error rate. The results are shown in the Tables below.

Table 1. Accuracy Rate of different classifier (%)

vowels ('svar')(%)

consonants ('vyanjan') without modifiers(%)

consonants ('vyanjan') with modifiers(%)

ED 72 74 71

PD 77 79 75

SM 76 73 71

MPD 82 80 78

PWC 85 83 80

MQDF 90 90 89

GLVQ 89 84 81

PC 80 82 77

FLD 88 83 81

LC 89 85 79

QC 90 78 76

NNC 90 91 90

EDKNN 93 95 94

CSKNN 92 94 91

PCKNN 90 93 90

CNN 93 93 91

RNN 94 91 91

FNN 95 93 93

NUN 96 95 95

SVM 95 93 94

RBF 92 93 90

HMM 94 91 91

Table 2. Average Error (%)

ERROR(%)

ED 6

PD 9 SM 8 MPD 4 PWC 6 MQDF 2

GLVQ 3

PC 2 FLD 2 LC 1 QC 3 NNC .23

EDKNN .16

(9)

6. Conclusion

In this paper, we have carried out experiment on our own dataset; we have arranged the dataset in three parts vowels ('svar'), consonants ('vyanjan') without modifiers and consonants ('vyanjan') with modifiers. We got the highest accuracy of 96%.

References

[1] Veena Bansal, R.M.K.Sinha, “A Devanagari OCR and A Brief Overview of OCR Research for Indian Scripts”, Proc. Symposium on Translation Support System (STRANS-2001), Kanpur, India, 2001.

[2] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second edition, Wiley Interscience, 2001.

[3] Cohn, D. A., Ghahramani, Z., & Jordan, M. I..Active learning with statistical models. Advances in Neural Information Processing Systems (pp. 705–712). The MIT Press (1995).

[4] T. Wakabayashi, S. Tsuruoka, F. Kimura and Y. Miyake, “Increasing the Feature size in handwritten Numeral Recognition to improve accuracy,” System and Computers in Japan, Vol.26, No.8, pp.35-44, 1995.

[5] A.Sato and K.Yamada.”Generalized learning Vector quantization ,”Advances in Nerual Information processing 8 ..Proc. of the 1995 conference .pp.223-229.,MIT press Cambridge ,MA.USA 1996

[6] M.Ikeda ,H.Tanaka and t.Motooka,”Projection distance method for recognition of handwritten character (in japanese) ,”Trans. ISP Japan Vol.24.no.1,pp 106-112,1983

[7] T. Wakabayashi, M. Shi, W. Ohyama, and F. Kimura: "A Comparative Study on Mirror Image Learning and ALSM":In Proc. 8th IWFHR, pp. 151-156, 2002.

[8] M. Cheriet, N. Kharma, C. lin Liu, and C. Suen. Character Recognition Systems: A Guide for Students and Practitioners. Wiley-Interscience, 2007

[9] D.F. Morrison. Multivariate Statistical Methods. McGraw-Hill, 1990.

[10] T.M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 21-27, Jan. 1967.

[11] P. E. Hart, “An asymptotic analysis, of the nearest-neighbor decision rule,”Stanford Electron. Lab., Stanford, Calif., Tech. Rep. 1828-2, SEL-66-016, May1996.

[12] E. Han and G. Karypis. “Centroid based document classification: Analysis & results.”, In Principles of Data Mining and Knowledge Discovery: fourth European Conference, pages 424-431, 2000.

[13] M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. A. Jr., and D. Haussler.”Knowledge-based analysis of microarray gene expression data by using support vector machines.”, Proceedings of the National Academy of Science, 97:262-267, 2000.

[14] P.E. Hart, “The Condensed Nearest Neighbor Rule,” IEEE Tran. on Information Theory, Vol. IT-14, No.3 pp.515-516, May 1967. [15] G.W. Gates, “The Reduced Nearest Neighbor Rule,” IEEE Tran. on Information Theory, Vol. IT-18, No.3 pp.431-433, May

1972.

[16] Tao Hong, Stephen W. Lam, Jonathan J. Hull and Sargur N. Srihari,’’The Design of a Nearest-Neighbor Classifier and Its Use for Japanese Character Recognition”, in the Proc.of the thied Int. Conf.on Document Analysis and Recognition,Aug.1995,270-273. [17] V. Vapnik. The Nature of Statistical Learning Theory. Springer,1995.

[18] T. Poggio and F. Girosi. Networks for approximation and learning.

Proceedings of the IEEE, 78(9):1481–1497, 1990.

[19] H. Bunke, M. Roth, and E. G. Schukat-Talamazzini. Offline Cursive Handwriting Recognition using Hidden Markov Models. Pattern Recognition, 28(9):1399–1413, 1995.

[20] Y.He, A.Kundu: 2-D shape classification Using Hidden Markov Model, IEEE Trans. On PAMI, vol.13,1991,pp.1172-1184.

[21] A.Kundu, Y.He, P.Bahl: Recognition of Handwritten Word:First and Second Order Hidden Markov Model BasedApproach, Pattern Recognition, 22(3), 1989, pp.283-297.