An experimental investigation of letter identification and scribe predictability in medieval manuscripts

(1)

Federal University of Rio Grande do Norte Center of Exact and Earth Sciences

Postgraduate Program on Systems and Computing Research Master in Systems and Computing

An experimental investigation of letter

identification and scribe predictability in

medieval manuscripts

Francimaria Rayanne dos Santos Nascimento

Natal-RN January 2020

(2)

Nascimento, Francimaria Rayanne dos Santos.

An experimental investigation of letter identification and scribe predictability in medieval manuscripts / Francimaria Rayanne dos Santos Nascimento. - 2020.

55f.: il.

Dissertação (Mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2020.

Orientador: Márjory Cristiany Da Costa Abreu.

1. Computação Dissertação. 2. Medieval manuscripts -Dissertação. 3. Diagonal feature extraction - -Dissertação. 4. Tremulous Hand of Worcester Dissertação. 5. Calligraphy -Dissertação. 6. Letter classification - -Dissertação. 7. Scribe identification - Dissertação. I. Abreu, Márjory Cristiany Da Costa. II. Título.

RN/UF/CCET CDU 004

Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

(3)

(4)

Francimaria Rayanne dos Santos Nascimento

An experimental investigation of letter identification

and scribe predictability in medieval manuscripts

Master dissertation entitled An

experimen-tal investigation of letter identification and scribe predictability in medieval manuscripts

submitted to the Federal University of Rio Grande do Norte as a partial requirement for obtaining the degree of Research Master in Systems and Computing.

Research area:

IMAGE PROCESSING AND COMPUTA-TIONAL INTELLIGENCE

Supervisor

Márjory Cristiany Da Costa Abreu

PPgSC – Postgraduate program in Systems and Computing CCET – Center for Exact and Earth Sciences

UFRN – Federal University of Rio Grande do Norte

Natal-RN

January 2020

(5)

Acknowledgements

Primeiramente agradeço a Deus por ter me guiado nesta caminhada, me dando forças e coragem para enfrentar todos os obstáculos que ocorreram durante esta trajetória.

A minha família e em especial ao meu pai (Josenildo Antônio Fernandes), pelos valiosos conselhos e por sempre acreditar em mim. A meu amado João Batista de

Oliveira Neto, pela dedicação e companheirismo.

Agradeço à minha orientadora, professora Márjory Cristiany da Costa Abreu, pelo empenho dedicado e por ser essa mãezona acadêmica. Obrigada pela presteza, pelo incentivo, e principalmente pelo papel de educadora. Eu não poderia deixar de agradecer aos meus irmãos de orientadora e amigos Julliana e Valmiro por terem lido a minha dissertação e pelo incentivo. De modo geral, agradeço a todos os meus queridos colegas e amigos do DIMAp, por tudo.

A Deborah Thorpe e Stephen, pelas importantes considerações e contribuições dadas a este trabalho.

Ao apoio financeiro da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), que foi de suma importância para conclusão deste trabalho.

(6)

An experimental investigation of letter identification

and scribe predictability in medieval manuscripts

Author: Francimaria Rayanne dos Santos Nascimento Supervisor: Márjory Cristiany Da Costa Abreu

ABSTRACT

Although the handwriting might seem archaic today in comparison with typed commu-nication, it is a long-established human activity that has survived into the 21st century. Accordingly, research interest into handwritten documents, both historical and modern, is significant. The way we write has changed significantly over the past centuries. For example, the texts of the Middle Ages were often written and copied by anonymous scribes. The writing of each scribe, known as his or her ‘scribal hand’ is unique, and can be differentiated using a variety of consciously and unconsciously produced features. Distinguishing between these different scribal hands is a central focus of the humanities research field known as "palaeography". This process may be supported and/or enhanced using digital techniques, and thus digital writer identification from historical handwritten documents has also flourished. The automation of the process of recognising individual characters within each scribal hand has also posed an interesting challenge. Some issues make these digital processes difficult about medieval handwritten documents. These include the degradation of the paper and soiling of the manuscript page. Thus, in this dissertation, we propose an investigation in both perspectives, character recognition and writer identification, in medieval manuscripts in an attempt to better understand the specific behaviour of two 800 year old scribes based on their manuscripts in comparison with a modern calligrapher. The experiments evidenced that the degradation, and the tremor (when present), can influence the analysis of old handwriting documents. However, the results presented an efficient accuracy with a better accuracy rate in the classification of the letter than in writer identification.

Keywords: medieval manuscripts, diagonal feature extraction, Tremulous Hand of Worcester,

(7)

List of Figures

Figure 1 – Examples of handwritten documents (a) Page extract from the Tremu-lous Hand of Worcester. Source: Detail of Oxford, Bodleian Library, Manuscript Junius 121, Folio vi recto. (b) Page extract from the non-tremulous scribe. Source: London, British Library, Stowe MS 34 (for-merly 240), f. 32r, Vices and virtues (c) Page extract from the modern expert calligrapher. . . 18

Figure 2 – Steps to handwriting analysis. . . 18

Figure 3 – An example of a bimodal histogram with selected threshold T ( RO-GOWSKA,2000). . . 19

Figure 4 – Classification and Characters Recognition (a) Image re-sized 90 × 60 (b) Partitioning image in 54 zones with size 10 × 10 (c) Diagonal features extraction. Took and adapted from (PRADEEP; SRINIVASAN; HIMAVATHI, 2011). . . 21

Figure 5 – Comparison between the Naive Bayes, SVM and MLP classifiers based on letters (‘a’, ‘e’, ‘h’ and ‘l’) for each scribe (TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER). . . 25

Figure 6 – Letters samples from TREMULOUS scribe erroneously classified. . . . 26

Figure 7 – Samples of ‘l’ confounded with ‘e’ by the Naive Bayes, SVM, and MLP classifiers from the NON-TREMULOUS scribe (Ref.: s3-l-06, s3-l-13, s3-l-22, s3-l-30). . . 27

Figure 8 – Comparison between the Naive Bayes, SVM and MLP classifiers based on the scribes (TREMULOUS, NON-TREMULOUS and CALLIGRA-PHER) for each letter (‘a’, ‘e’, ‘h’ and ‘l’). . . 28

Figure 9 – Samples of the letter ‘a’ erroneously classified by more than one classifier. 29

Figure 10 – Samples of the letter ‘e’ erroneously classified by more than one classifier. 30

Figure 11 – Samples of the letter ‘h’ erroneously classified . . . 31

Figure 12 – Samples of the letter ‘l’ erroneously classified by more than one classifier 33

Figure 13 – Comparison between the Naive Bayes, SVM and MLP classifiers based on scribe ( TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER). 34

Figure 14 – The letter ‘a’ samples used to analyse character features . . . 35

Figure 15 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘a’. The x-axis represents the height in HPP and width in VPP of the character in pixels and y-axis is the sum of pixel values belonging to the character. 36

(8)

Figure 17 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘e’. The x-axis represents the height in HPP and width in VPP of the character in pixels and y-axis is the sum of pixel values belonging to the character. 37

Figure 18 – The letter ‘h’ samples used to analyse character features. . . 37

Figure 19 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘h’. The x-axis represents the height in HPP and width in VPP of the character in pixels and y-axis is the sum of pixel values belonging to the character. 38

Figure 20 – The letter ‘l’ samples used to analyse character features. . . 38

Figure 21 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘l’. The 𝑥-axis represents the height in HPP and width in VPP of the character in pixels and 𝑦-axis is the sum of pixel values belonging to the character. 39

Figure 22 – PDV metric values for letters ‘a’, ‘e’, ‘h’ and ‘l’ from TREMULOUS, NON-TREMULOUS and CALLIGRAPHER scribes. The x-axis repre-sents the cells (K = 8) of the characters and y-axis is the pixel density average from five samples of the cells belonging to the character. . . . 40

Figure 23 – HPP - Correlation between samples of the letter A of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 41

Figure 24 – VPP - Linear correlation between samples of the letter A of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 42

Figure 25 – HPP - Linear correlation between samples of the letter E of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 42

Figure 26 – VPP - Linear correlation between samples of the letter E of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 43

Figure 27 – HPP - Linear correlation between samples of the letter H of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 43

Figure 28 – VPP - Linear correlation between samples of the letter H of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 44

Figure 29 – HPP - Linear correlation between samples of the letter L of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 44

Figure 30 – VPP - Linear correlation between samples of the letter L of each scribe (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 45

Figure 31 – HPP - Linear correlation between the scribes (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 46

Figure 32 – VPP - Linear correlation between the scribes (TREMULOUS, NON-TREMULOUS, CALLIGRAPHER). . . 46

(9)

List of Tables

Table 1 – Summary of related work that analysis historical documents using tech-nological process . . . 14

Table 2 – Summary of related work that analyse historical documents in the context of disease and disorders . . . 16

Table 3 – Confusion matrices for the TREMULOUS data set. The two confusion matrices at the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the bottom relates to the MLP classifier. The rows header indicates the predicted class (‘a’,‘e’,‘l’,‘h’) and the columns header represents the class (‘a’,‘e’,‘l’,‘h’) by the classifiers. . . 26

Table 4 – confusion matrices for the NON-TREMULOUS data set. The two con-fusion matrices at the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the bottom is about the MLP classifier. The rows header indicates the predicted class (‘a’,‘e’,‘l’,‘h’) and the columns header represents the class (‘a’,‘e’,‘l’,‘h’) by the classifiers. . . 27

Table 5 – confusion matrices for the CALLIGRAPHER data set. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. The rows header indicate the predicted class (‘a’,‘e’,‘l’,‘h’) and the columns header represent the class (‘a’,‘e’,‘l’,‘h’) by the classifiers. . . 28

Table 6 – confusion matrices for the ‘a’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right. . . 29

Table 7 – confusion matrices for the ‘e’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right. . . 30

(10)

Table 8 – confusion matrices for the ‘h’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right. . . 31

Table 9 – confusion matrices for the ‘l’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right. . . 32

Table 10 – confusion matrices for the writer identification. The two confusion ma-trices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right. . . 34

(11)

1 Introduction

The analysis of handwritten documents is a relevant problem of the machine learning, image processing and pattern recognition research areas. For such applications, it is important to consider the different types of handwritten documents that can be analysed, as pre-modern and contemporary manuscripts (TOLEDO et al.,2017; ZAMORA-MARTINEZ et al., 2014).

One of the most important tasks in the handwritten analysis is the writer iden-tification that aims to determine the author of a specific text, given a training data set (CHRISTLEIN et al.,2017). The data extraction process can be categorised into two types:

online and off-line (WANG et al., 2018;KAMBLE; HEGADI,2015). In the online process, the data is collected based on temporal-space data. In another hand, the off-line process obtains the information by processing only the document image data, without using any dynamic features.

Another relevant (and related) problem is Handwritten Text Recognition (HTR) in which handwriting is extracted of images and converted into a similar symbolic value as ASCII text, by a computer (ROMERO; TOSELLI; VIDAL, 2012). However, distinct languages, writing styles, and input contents may increase the HTR system complexity.

Analysing old manuscripts in this context may present particular problems, such as ink degradation, background noise, dye stain and writing failures. Thus, the handwriting analysis systems may be less accurate than machine-printed documents, especially when dealing with degraded historical documents (SALEEM et al.,2014).

An important consequence (or contribution if found) of analysing old documents is to understand the characteristics of each scribe and try to identify any relevant issue that the scribe might have, such as high consumption of alcohol (HAUBENBERGER et al., 2011) or neurological conditions (THORPE, 2015). By finding written characteristics of each of those conditions, with only off-line handwriting analysis of the manuscripts, we can better understand modern conditions as well (SCHIEGG; THORPE, 2017; ZHI et al.,

2017).

In this work, documents written in Middle English by three different scribes are investigated. Two of them were thirteenth-century scribes, and the third is a modern expert calligrapher writing using medieval tools and techniques. In one of the scribes, there are no evident pathological writing features, and the other is an anonymous scribe known as ‘The Tremulous Hand of Worcester’, or simply ‘the Tremulous Hand’ due to his distinctive writing tremor. A previous study was made byThorpe and Alty (2015), based on features of the scribe’s writing, indicated that he had essential tremor. The authors found that the handwritten demonstrated regular amplitude tremor - the individual letters in words presented the same degree of lateral derivation. This common neurological condition

(14)

Chapter 1. Introduction 11

typically presents a ‘fine’ or ‘fine-moderate’ amplitude of tremor, with a relatively fast frequency.

Thus, this dissertation proposes an investigation of degraded documents as well as of a contemporary text, in the context of diseases and disorder, from identifying the impact of the tremor (when present) to recognising letters and the scribe him/herself. Once that even medieval handwriting can contain recognisable signs of the neurological conditions of the writer (THORPE; ALTY, 2015). Thus, the analysis of historical handwritten can help to understand the impacts the diseases and disorder in context medieval.

Since our aim is to investigate individual letters images from the three writers, we have chosen to use a simple and efficient method for the extraction of image features, called diagonal features extraction — this method uses image segmentation in zones. Experiments performed by Pradeep, Srinivasan and Himavathi (2011), for the character recognition using neural networks to character classify, showed that the method is effective.

1.1 Motivation

The analysis of manuscripts is an important research problem that can provide data about the text but also from the writer. Handwriting can give us useful information about the clinical condition of a person since the writing involves cognitive, and psychomotor processes (IMPEDOVO; PIRLO, 2018; POON et al., 2019). Moreover, in posthumous diagnosis cases, sometimes medical records, often incomplete or even discordant, and written documents are the only sources of information for diagnosis (BALESTRINO et al., 2012; FONTANA et al., 2008). The study of text old or medieval manuscripts, called palaeography, also investigates characteristics of neurological diseases that can be found in these documents as the “essential tremor" (THORPE, 2015).

The most famous anonymous writer with essential tremor, known as ‘the Tremu-lous Hand of Worcester’, is of great importance for investigating pathological evidence from writing, for he is the only widely known medieval writer with a tremor, in addition to his particular interest in performing translations of documents written centuries earlier (THORPE; ALTY, 2015). As the documents produced by the scribe did not have

infor-mation about his tremor, his handwriting and limited clues on this subject are the only sources for the study of his clinical condition.

Moreover, some features, as the frequency of the tremor, are hard measures from historical documents because there is no information about the pen speed during writing. However, the letters with long downstrokes of similar height such as ’l’, ’h’, ’k’ and ’p’ can be used to estimate the frequency based on the number of crossings about the midline (THORPE; ALTY,2015).

In this way, the analysis of the handwriting of the Tremulous Hand of Worcester in contrast to a handwritten document of a similar epoch produced by another scribe who

(15)

Chapter 1. Introduction 12

had no pathological evidence in writing and a modern calligraphy written with medieval tools and techniques may help to identify/measure some characteristics present in the writing, which are necessary for the diagnoses of the scribe. Thus, the evidence extracted in this work may help or support the development of technological tools that perform the process of analysis of manuscript characteristics to identify the clinical condition of an individual.

1.2 Objectives

The main objective of this work is to investigate medieval and modern handwriting in order to understand the document characteristics and to compare the impact of the presence of disorders to recognising letters and the scribe herself. To this goal, the following specific objectives are stipulated:

a) Extraction and analysis of character features in order to understand the pre-processing of the text;

b) Investigate how better to classify characters of each individual writer;

c) Investigate how better to classify each individual character by writer;

Following this clear path, we believe we can better understand how to apply modern technology and associate our findings with what we know today about the neurological conditions.

1.3 Overview

This work is organ’ised as follows: Chapter 2presents the related works. Chapter

3 explains the methodology adopted for character recognition, writer identification and metrics to character analysis. Chapter4discuss the results obtained. And Chapter5shows the work conclusions and our future work.

(16)

13

2 Related work

The literature investigates some approaches for the recognition and treatment of the degraded historical manuscript documents. Based on the technological approach to automate the analysis of historical documents, on the other hand, the palaeography also investigates characteristics of the neurological condition of medieval scribes from these documents. Thus, some related work is described in this chapter.

2.1 Analysis of historical documents using technological

pro-cess

The automatic process of analysing historical documents is not trivial specially due to the fact that there is a limitation in the amount of features in the old manuscripts. In order to better understand and explorer this application, there are some approaches, based on image processing, pattern recognition, and machine learning research areas, that are proposed to solve some questions, such as treatment of degradation, binarisation, character recognition, text segmentation, scribe identification. Some of the most recent works that explore some of these areas, will be presented below.

Gatos, Pratikakis and Perantonis (2006) proposed the use of an adaptive binarisa-tion technique to enhance degraded historical handwritten documents, old newspapers, and poor quality modern documents using Wiener’s filter and Sauvola’s approach. Su, Lu and Tan (2010) used an approach based on local maximum and minimum for binarisation of handwritten historical documents. According to the authors, the method was efficient in the treatment of different types of degradation in documents such as smear and irregular illumination.

The use of local features and SVM to select an optimal global threshold for binarisation of degraded historical documents was present by Xiong et al. (2018). The authors evaluated the performance of the method in 21 images of degraded manuscripts. The results presented a superior performance in comparison with other state-of-the-art techniques.

An adaptive method based on models of characters for segmentation of highly degraded historical documents was proposed by Bar-Yosef et al. (2009) and applied to an antique Jewish prayer book, written between the 11th and the 13th centuries. They used 500 degraded characters of 16 different Hebrew letters for identification. The idea was to construct a small set of shapes based on the training set, with variable shapes, so the system could classify different forms of writing a determined letter. The Normalised Cross-Correlation (NCC) was applied to identity the matching characters yielding 98% of

(17)

Chapter 2. Related work 14

correct recognition.

References Database Objective Feature Extrac-tion Technique Technics Results Gatos, Pratikakis and Perantonis (2006) Degraded and poor quality documents Treatment of degraded histor-ical documents Wiener’s filter and Sauvola’s Adaptive binari-sation technique –

Su, Lu and Tan

(2010) Handwritten historical docu-ments Treatment of degraded histor-ical documents Local maximum and minimum binarisation technique – Xiong et al. (2018) 21 images of degraded document manuscripts Select an op-timal global threshold for binarisation of degraded histor-ical documents Global thresh-old Local features and SVM – Bar-Yosef et al. (2009) Antique Jewish prayer book, written between 11th and the 13th centuries Character recog-nition A small set of shape models Normalised Cross-Correlation (NCC) 98% Bukhari et al. (2012) Arabic histori-cal manuscripts Text segmenta-tion Connected-component level MLP classifier and voting system 95%

Diem and Sab-latnig(2010b) Degraded an-cient Slavonic manuscripts from the 11th century Character recog-nition Local descrip-tors Multiple SVMs with RBF kernel Nor.: 86,2% Deg.: 53,9% setB: 71,2 %

Diem and Sab-latnig(2010a)

Glagolitic char-acters from the St. Catherine’s Monaster

Character recog-nition

Local descriptor SVM classifier and a voting system Nor.: 86,2% Deg.: 53,9% setB: 71,2 % Saleem et al. (2014) Glagolitic char-acters Character recog-nition SIFT and NNDM SIFT and NNDM Nor.: 95,9% Deg.: 84,4% Saleem, Hollaus and Sablatnig (2014) Glagolitc char-acters written in the 11th century Character recog-nition

Local minima Dense SIFT and nearest neigh-bour distance maps Nor.: 96,2% Deg.: 83,7% Gilliam, Wil-son and Clark

(2010) Medieval English manuscripts of the 14th -15th century Scribe identifica-tion Grapheme code-book method SMLR and K-NN 78%

Table 1 – Summary of related work that analysis historical documents using technological process

A layout analysis of Arabic historical manuscript documents with machine learning was proposed by Bukhari et al. (2012). The authors performed the extraction with simple and discriminatory features at connected-component level, sequentially, producing in a robust feature vector. The neural network MLP was used for component classification of the text, follewed by a voting system as the final classifier.

Diem and Sablatnig (2010b) presented a methodology for character recognition in degraded ancient Slavonic manuscripts from the 11th century. The authors proposed a system for OCR divided into two major steps: character classification and character location. In this approach, they explored the local descriptors directly extracted from

(18)

grey-scale images. Then, the Multiple SVMs with Radial Basis Function (RBF) kernels were used to classify the descriptors. The k-means algorithm was employed for character localisation based on interest points using the local descriptors previously computed. The authors used a voting system for final character classification.

Diem and Sablatnig (2010a) described an approach based on local descriptors where characters were localised using clustering techniques. The authors tried to recognise and classify Glagolitic characters in degraded manuscript documents like the ones from the St. Catherine’s Monastery. They used the SVM to classify the local descriptors and then a voting system to character recognition.

An approach for recognition of Glagolitic characters in degraded historical docu-ments was proposed by Saleem et al. (2014). The character recognition with Dense Scale Invariant Feature Transform (SIFT), and Nearest Neighbour Distance Maps (NNDM) were used to localise and classify Glagolitic characters. A new pre-processing method proposed by the author and Total Variation (TV) were used for images restoration and noise reduction. The experiments performed showed that the best results were obtained using image restoration as a pre-processing step to Dense SIFT for the analysed documents.

Saleem, Hollaus and Sablatnig (2014) used Dense SIFT and nearest neighbour distance maps to recognise Glagolitic characters in degraded documents written in the 11th century. In this approach, the local minima was used for localisation and character recognition in documents according to NNDM output algorithm.

An approach of scribe identification in medieval English manuscripts the 14th -15th century was presented byGilliam, Wilson and Clark (2010). The authors compared the performance of the Sparse Multinomial Logistic Regression (SMLR) with the k-NN classifier, achieving a classification accuracy over 78% using the grapheme code-book method.

2.2 Analysis of historical documents in the context of

dis-ease and disorders

In the context of palaeographic studies, diseases and disorders of ancient scribes are investigated based on their handwriting features. In this way, the researchers try to analyse and to understand the clinical condition of old writers, using some techniques and tools of modern medicine. Some of the most recent works will be discussed below.

An analysis of historical handwriting distortions, from patients in an early 20th-century psychiatric hospital in southern Germany was presented by Schiegg and Thorpe

(2017). The study demonstrated that combining historical handwriting analysis with modern medicine can to help re-contextualise individual writing disorders and could offer information about medical conditions that involve writing disorders.

(19)

Thorpe, Alty and Kempster (2019) proposed a retrospective diagnostic of John Ruskin (1819-1900). Based on the palaeographical study, the authors investigated the relationship between features of writing through his letters and diary entries and Ruskin’s clinical condition. Thus, they concluded that he had an organic neurological disorder.

References Database Objective

Schiegg and Thorpe(2017) Writing tests and letters, from pa-tients in an early 20th-century psy-chiatric hospital in southern Ger-many (Irsee/Kaufbeuren)

Show through different case stud-ies how the scrutiny of historical handwriting can re-contextualise individual writing disorders

Thorpe, Alty and Kempster(2019) Letters and diary entries of John Ruskin (1819-1900)

A paleographic investigation from the historical document and retro-spective medical diagnosis of John Ruskin

Table 2 – Summary of related work that analyse historical documents in the context of disease and disorders

2.3 Chapter summary

In this chapter, we have presented the analysis and treatment of degraded historical documents based on different approaches as binarisation, character recognition, and scribe identification, showing that relevant investigations that were performed in this area and can serve as a basis for our research.

The treatment and analysis of degraded historical texts were shown to be relevant and challenging problems, due to particular issues of the old documents, investigated in literature (see summary in Table 1). However, besides common issues of the medieval manuscripts, signs of the scribe’s clinical condition can be identified and may also influence (when present) handwriting a analysis (see summary in Table 2).

Based on this, in our work, we will perform an experimental investigation in degraded documents of the Middle Ages and a modern text, in the context of disease and disorders, to understand the characteristics and to compare the impact of the presence of movement of disorders in the handwriting, using a computational approach.

(20)

17

3 Understanding how to analyse user

and letter identification from

me-dieval data

From what we have observed in Chapter 1, the analysis of historical handwritten texts is a relevant research question, and for such related applications, we have investigated different tasks. But firstly, we need to understand the problems old manuscripts may present that could influence in the experiments accurate rate.

The historical handwritten documents present problems, such as degradation, ink stains, fail to write, background noise, and so on. Besides, particular characteristic caused by diseases and disorders (when present) can increase the difficulty of the handwriting analysis process.

Considering these problems, we have performed a series of steps to extract and understand the data of medieval handwritten document of the scribe "Tremulous Hand of Worcester" in contrast with texts written in Middle English and by a modern calligrapher.

The analysed data sets consist of characters cropped manually from handwritten documents (see Figure 1). The characters were extracted from:

a) TREMULOUS - a sample from the 13th century written by an anonymous scribe known as "Tremulous Hand of Worcester" (171 character samples of TREMULOUS),

b) NON-TREMULOUS - a sample from the 13th century written by a non-tremulous scribe (171 character samples of NON-TREMULOUS) and

c) CALLIGRAPHER - a reproduction of the text copied by the tremulous writer by a modern expert calligrapher (80 character samples of CALLIGRAPHER).

Our data set contains a total of 422 samples of letter images (‘a’, ‘e’, ‘h’, and ‘l’). We chose the letters ‘a’ and ‘e’ because they have a similar shape and are more frequency in the text to explore a possible OCR system and, on the other hand, the letters ‘h’ and ‘l’ because they have a long vertical stroke; therefore the tremor in writing is more evident in them. Our investigation was done on only this limited data set, due to the resource limitation for data collection.

The TREMULOUS and NON-TREMULOUS data sets have 30 examples of the letter ‘a’, 90 of the letter ‘e’, 20 of the letter ‘h’ and 31 of the letter ‘l’, resulting in 171 examples from NON-TREMULOUS and 171 from TREMULOUS, with a total of 342 samples. The CALLIGRAPHER data set is formed by 20 examples of each an of the letters (‘a’, ‘e’, ‘h’, ‘l’), resulting in 80 samples. For the task of character recognition, the

(21)

Chapter 3. Understanding how to analyse user and letter identification from medieval data 18

data set used classified in the ‘a’, ‘e’, ‘h’ and ‘l’ classes. In the task of writer identification, on the other hand, the data are classified in the TREMULOUS, NON-TREMULOUS and CALLIGRAPHER classes.

(a)

(b)

(c)

Figure 1 – Examples of handwritten documents (a) Page extract from the Tremulous Hand of Worcester. Source: Detail of Oxford, Bodleian Library, Manuscript Junius 121, Folio vi recto. (b) Page extract from the non-tremulous scribe. Source: London, British Library, Stowe MS 34 (formerly 240), f. 32r, Vices and virtues (c) Page extract from the modern expert calligrapher.

In this work, the process to handwriting analysis has consisted of the steps pre-processing, segmentation and feature extraction, classification and recognition, and post-processing (see Figure 2), which will be presented in the next sections.

(22)

3.1 Pre-processing

Pre-processing is the initial step, where a series of essential operations are per-formed to the treatment of the input images, mainly in our context of degraded images (GATOS; PRATIKAKIS; PERANTONIS, 2006).

The global threshold was used to convert from a grey-scale image to a binary image, where pixels with value 0 representing the background, and with value 1 representing the object. This method allows to define the data with intensity below the threshold as belong-ing to one phase, and the remainbelong-ing for the other, performbelong-ing a good separation between the background and the foreground (SEZGIN; SANKUR, 2004; GATOS; PRATIKAKIS; PERANTONIS,2006). Assume that we have an image 𝐼(𝑥, 𝑦) with histogram presented in Figure 3.

Figure 3 – An example of a bimodal histogram with selected threshold T (ROGOWSKA,

2000).

Two dominant modes of grey levels are grouped, corresponding to the background and object pixels (ROGOWSKA, 2000). Thus, a threshold 𝑇 from grey-scale image 𝐼 that separates these models using Otsu’s method (OTSU,1979) is selected to extract the object from the background. The thresholded image 𝑔′(𝑥, 𝑦) is defined in Equation 3.1.

𝑔′(𝑥, 𝑦) = ⎧ ⎨ ⎩ 1 𝑖𝑓 𝑔(𝑥, 𝑦) > 𝑇 0 𝑖𝑓 𝑔(𝑥, 𝑦) ≤ 𝑇 (3.1)

Subsequently, the Sobel technique was used for edge detection in the binarised image. Compared to other edge detection techniques, the Sobel has the advantages of performing some smoothing effect in the image noise, detection, and differentiation of two rows or two columns. Thus, the edges of both sides are highlighted (GUPTA; MAZUMDAR,

(23)

2013). Each image was segmented into individuals characters, uniformly re-sized into 90×60 pixels.

3.2 Feature extraction

At this stage, we have used the diagonal feature extraction scheme proposed by

Pradeep, Srinivasan and Himavathi (2011). The method is based on three ways of feature extraction are vertical direction, horizontal direction, and diagonal direction. Thus, the feature extraction process is simple and provides good recognition accuracy.

The images of size 90 × 60 were partitioned in 54 zones with equal dimensions, with sizes 10 × 10 (an example can be seen in Figure 4c). The features are extracted from zone pixels. Each zone has 19 diagonals lines. The values along each diagonal line are summed, resulting in 19 sub-features of each zone. Thus, the averaging of the 19 sub-features value forms the single feature value of the corresponding zone (see Figure 4b). The process repeats for all blocks, resulting in 54 features.

Besides the diagonal features, we have also measured the average of rows and columns of the zones, resulting in 9 characteristics corresponding to the rows and 6 to the columns. Thereby each individual character was represented by 69 features. Following algorithm describes the process of feature extraction for each letter image.

STEPS FOR FEATURE EXTRACTION Input: Character image

Output: 69 features

Step I: Splits the image (size 90 × 60) into n (54) zones of equal dimensions, with

size 10 × 10;

Step II: The features are extracted from zone pixels;

Step III: The pixel values are summed along diagonals lines result in 19

sub-features;

Step IV: Theses 19 sub-features are averaged resulting in a feature; Step V: The process repeats for all zones, resulting in 54 features;

Step VI: Compute the average of rows (9 features) and columns (6 features) of

the zones.

3.3 Classification and character recognition

After the feature extraction stage, the character recognition and writer identifica-tion were performed to analyse potential of the database. Thus, we have used the SVM and MLP classifiers, because they had efficient performance in related work (BUKHARI et al.,2012;DIEM; SABLATNIG, 2010b; DIEM; SABLATNIG, 2010a; XIONG et al., 2018).

(24)

Figure 4 – Classification and Characters Recognition (a) Image re-sized 90 × 60 (b) Parti-tioning image in 54 zones with size 10 × 10 (c) Diagonal features extraction. Took and adapted from (PRADEEP; SRINIVASAN; HIMAVATHI, 2011).

The Naive Bayes was applied to data classify from the perspective probabilistic. We did not use the Deep Learning approach, because in order to achieve a prediction accuracy efficiently, we would need a very large amount of training data (CHILIMBI et al., 2014), and since we only have a small amount of data, that has become an impossibility. The used classifiers are available in the Weka toolbox 1 _{that can be described as:}

∙ Naive Bayes (MITCHELL,2005) is a probabilistic classifier based on the application of the Bayes’ theorem. The method uses the idea that the probabilities amongst the features are strongly independent, instead of calculating the value for each attribute

𝑥𝑖 given a hypothesis ℎ 𝑃 (𝑥1, 𝑥2, 𝑥3|ℎ), simplifies calculation of the probability for

each value as 𝑃 (𝑥1|ℎ) * 𝑃 (𝑥2|ℎ) * 𝑃 (𝑥3|ℎ). All experiments performed used WEKA’s

implementation of the Naive Bayes (WEKA, b) with the following configuration:

useKernelEstimator 𝑓 𝑎𝑙𝑠𝑒, useSupervisedDiscretization 𝑓 𝑎𝑙𝑠𝑒.

∙ SVM is a classifier based on Statistical Learning Theory, proposed inVapnik (1998). The algorithm finds the optimal separating hyperplane, so that, the highest number of points belonging to the same class stays on the same side while maximising the distances of each class to that hyperplane. The subsets of the training points of the two classes, closest to the optimal separating hyperplane, are called support vectors (SURINTA et al., 2015). WEKA’s implementation of the SVM (WEKA, c)

(25)

algorithm is based on the implementation from Keerthi et al. (2001),Platt (1998). In our experiments, the following configuration was used: C 1.0, Kernel Function PolyKernel.

∙ Neural network MLP (HAYKIN, 2007) is a generalisation of the simple algorithm Perceptron. MLP is formed by an input layer, one or more hidden layers and one output layer. The training of MLP is supervised, and it uses the error back-propagation algorithm. In the WEKA’s implementation of the MLP (WEKA,a), all nodes are sigmoid. The classifier allows parameters to modification as the number of hidden Layers, Learning Rate, training time, and Momentum Rate. All experiments were performed using the following configurations: hidden layers 𝑎 ((number of

attributes + number of classes) / 2), learning rate 0.3, momentum 0.2, training time

500, threshold 20, validation set 0.

Different simulations were performed to select the classifiers configurations. Thus, the configurations that obtained the best results in a majority of cases investigated were selected.

The 𝑘-fold cross-validation, with 𝑘 equals to ten, was used to estimate the accuracy of classification algorithms because extensive tests in different data sets with various machine learning techniques have shown that ten-fold cross-validation presents a good compromise, to get the best estimate of error (MCLACHLAN; DO; AMBROISE, 2005;

WITTEN et al.,2016). The method consists of partitions the data set randomly into ten mutually exclusive subsets, where one subset is used to model validation and the others subsets to model estimate (training), where each part is held out in turn (KOHAVI et al.,

1995). The method is applied ten times on different training data set. Then, the averaged error estimated is used to produce an overall error estimation. Thus, all our presented confusion matrices will have the overall error estimate, provided by the ten iterations.

3.4 Character analysis

In this work, we proposed to use five samples of the letter ’a’, ’e’, ’h’ and ’l’ from TREMULOUS, NON-TREMULOUS and CALLIGRAPHER scribes. This analysis is intended to detect changes in the characteristics of documents as the quantity of ink deposited, size-reduction, and pixel density within a writing sample.

Since mentioned previously, we have chosen the letters ’a’ and ’e’ due to the fact that they have a similar shape. On the other hand, we have chosen the letters ’h’ and ’l’ because they have a longer vertical stroke, where the tremor is more visible.

Prior to calculate the metrics, we have done some image pre-processing. The grey-scale image background was removed, but only the CALLIGRAPHER data set image’s background was removed automatically. We have done manual adjustments in the

(26)

historical document due to its degradation, writing failures, background noise, and dye stain which were considered as visual noise. This step was developed by the authors and performed in MATLAB2 _{to remove the marks not relevant in the characters.}

Posterior, two types of metrics were used for character analysis based in pixels values: Global and Local metrics. The metrics will be described in the next section.

3.4.1 Global metric

The global metrics are based on the function 𝑓 (𝑥, 𝑦), that is defined on the plane (𝑥, 𝑦), where, at the point (𝑥0, 𝑦0) 𝜖 (𝑥, 𝑦), whenever a pixel belongs to the background

of the letter 𝑓 (𝑥, 𝑦) is zero, else 𝑓 is the pixel value of the digitised image in grey-scale in point (𝑥, 𝑦), because as we used historical documents and need of the pixel value in each sample. The horizontal projection profile (HPP) is calculated by Equation 3.2, and vertical projection profile (VPP) by Equation 3.3 (ZHI et al., 2017).

𝐻𝑃 𝑃 (𝑥) =∑︁ 𝑦 𝑓 (𝑥, 𝑦) (3.2) 𝑉 𝑃 𝑃 (𝑦) = ∑︁ 𝑥 𝑓 (𝑥, 𝑦) (3.3)

The HPP denotes the sum of all pixels along a row in the image. Similarly, VPP is the columns sum (MAMATHA; SRIKANTAMURTHY, 2012). The area of an HPP or VPP curves indicates the number of pixels accumulates in the samples, that corresponds to the quantity ink used in the samples of manuscripts.We expect that distortion in writing to influence in the shape of the HPP or VPP curves.

3.4.2 Local metric - Pixel Density Variation (PDV)

The PDV metric is used to evaluate resources of compression/distortion in the samples, because some effects may arise progressive on writing, such can be detected based on characteristics changing locally from left to right within a sample (ZHI et al., 2017).

For calculation of the PDV metric, each image is split into cells of identical width. The amount of ink in the upper and lower boundaries determines the height of the cells. Thus, the area 𝐴𝑖𝑗𝑘 is calculated by the product between the height and width of the cell, with 𝑖 and 𝑗 being the subject and sample number respectively, and 𝑘 being the cell number. The pixel density 𝜌𝑖𝑗𝑘 is calculated by the quantity of ink pixels 𝑃𝑖𝑗𝑘 in each cell divided by area 𝐴𝑖𝑗𝑘 by Equation 3.4.

𝜌𝑖𝑗𝑘 =

𝑃𝑖𝑗𝑘

𝐴𝑖𝑗𝑘

(3.4)

(27)

The 𝜌𝑖𝑗𝑘 is plotted along of the cell localisation 𝑘 = 1, ..., 𝐾 of the sample counting from left to right, with 𝐾 being the total cell number.

3.4.3 Statistical analysis of metrics

In order to better perform a statistical analysis of the Global metrics (HPP and VPP), we have used Pearson’s correlation coefficient, that measures the linear relationship between two variables (TAYLOR, 1997). The metric is defined by Equation 3.5, in terms of the covariance, given a pair of random variables (𝐴, 𝐵):

𝜌(𝐴, 𝐵) = 𝑐𝑜𝑣(𝐴, 𝐵) 𝜎𝐴𝜎𝐵

(3.5)

where, 𝑐𝑜𝑣 is the covariance, and 𝜎𝐴 and 𝜎𝐵 are the standard deviations of A and B, respectively. The correlation coefficient matrix is defined by Equation 3.6, which is the matrix of correlation coefficients for each pairwise variable combination.

𝑅 = ⎛ ⎝ 𝜌(𝐴, 𝐴) 𝜌(𝐴, 𝐵) 𝜌(𝐵, 𝐴) 𝜌(𝐵, 𝐵) ⎞ ⎠ (3.6)

The value 𝜌(𝐴, 𝐵) is in the range −1 and 1. Values near −1 indicate a direct, negative linear correlation; values near 0 indicate little or no linear correlation; and values near 1 indicate a direct, positive linear correlation (BODDY; SMITH,2009). Considering that the samples are always directly correlated to themselves, thus the entries diagonal are 1. We have hiden the value of the diagonal (1) for a better view of the variation of the other values.

3.5 Chapter summary

In this chapter, we have presented the description of the data set used and the necessary steps to reach the objective of this work, exploring some concepts required to understand better the proposal of this work based on literature.

In the process of handwriting analysis, we used pre-processing techniques to the treatment of the input images and binarisation, once the ancient manuscripts present particular problems. For feature extraction, we choose a simple and efficient diagonal feature extraction scheme. The feature extraction describes the relevant aspects or attributes that characterise the character so that the task of classifying the character is made easy. Also, we used global (HPP and VPP) and local (PDV) metrics to character analysis intended to detect changes in the pixel density of characters. Finally, the statistical analysis of the metrics was used to investigate the statistical relationship between two variables, in order to understand the correlation between different scribes as well as between different letters.

(28)

25

4 Experiments and results

The techniques we have presented for character recognition have the primary aim of evaluating degraded historical documents written in Middle English and a manuscript written by a modern calligrapher. Experiments with different perspectives were carried out on the data sets and will be presented in the following sections.

Section4.1will present the classification of letters dependent on the writer, Section

4.2 will present the classification of letters independent on the writer, Section 4.3 will present the writer identification, Section 4.4 will present the analysis of pixel density, in Subsection 4.4.1 will present the statistical analysis of correlation between the letter samples, and in Subsection4.4.2will present the analysis of correlation between the scribes.

4.1 Classification of letters

Figure 5 shows the accuracy obtain by Naive Bayes, SVM and MLP classifiers in the letter classification task (‘a’, ‘e’, ‘h’ and ‘l’) for each individual scribe. The experiments show good classification accuracy for each of the three writers. The classifier accuracy for each scribe is greater than 92.9%.

Figure 5 – Comparison between the Naive Bayes, SVM and MLP classifiers based on letters (‘a’, ‘e’, ‘h’ and ‘l’) for each scribe (TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER).

The confusion matrices for the TREMULOUS data set (Table 3) show accuracy rates above 96%. However, all classifiers confound ‘l’ by ‘e’ and ‘e’ by ‘l’(see the samples erroneously classify by all classifiers in Figure 6a, only MLP and SVM in Figure6b and Naive Bayes in Figure 6c). The letter ‘l’ have irregularities present in your shape with the top with a rounded stroke in some cases, noise background, and tremor present in write.

(29)

Chapter 4. Experiments and results 26

Also, the letter ‘e’ has degraded ink. These problems can have influenced the erroneous classification. Naive Bayes a e h l a 29 1 0 0 e 0 88 0 2 h 1 0 19 0 l 0 3 0 28 SVM a e h l a 30 0 0 0 e 0 89 0 1 h 0 0 20 0 l 0 2 0 29 MLP a e h l a 30 0 0 0 e 0 89 0 1 h 0 0 20 0 l 0 1 0 30

Table 3 – Confusion matrices for the TREMULOUS data set. The two confusion matrices at the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the bottom relates to the MLP classifier. The rows header indicates the predicted class (‘a’,‘e’,‘l’,‘h’) and the columns header represents the class (‘a’,‘e’,‘l’,‘h’) by the classifiers.

(a) Letter ‘e’ confounded with ‘l’ by all classifier (Ref.: s1-e-56)

(b) Letter ‘l’ confounded with ‘e’ by SVM and MLP

clas-sifier (Ref.: s1-l-21)

(c) Letter ‘l’ confounded with ‘e’ by Naive Bayes (Ref.:s1-l-10, s1-l-10, s1-l-19)

Figure 6 – Letters samples from TREMULOUS scribe erroneously classified.

For the NON-TREMULOUS data set, the confusion matrices of the classifiers (Table 4) show that the letter ‘l’ presents the most significant error rate. Most times, the classifiers confound ‘l’ by ‘e’ (see samples in Figure 7), on the other hand, the letter ‘h’ is the best classified in this data set by the SVM and MLP. The classifiers obtained better accuracy in the data set of the TREMULOUS than in NON-TREMULOUS data set. That may have been caused by a dense noise in the test data set, due to some data samples have parts of adjoining letters.

The data set CALLIGRAPHER writer (Table 5) presents better results in com-parison with confusion matrices of the other two data sets analysed. The absence of

(30)

Chapter 4. Experiments and results 27 Naive Bayes a e h l a 26 1 1 2 e 0 89 0 1 h 2 0 18 0 l 0 0 0 31 SVM a e h l a 26 2 1 1 e 0 88 0 2 h 0 0 20 0 l 1 5 0 25 MLP a e h l a 27 1 1 1 e 1 89 0 0 h 0 0 20 0 l 0 4 0 27

Table 4 – confusion matrices for the NON-TREMULOUS data set. The two confusion matrices at the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the bottom is about the MLP classifier. The rows header indicates the predicted class (‘a’,‘e’,‘l’,‘h’) and the columns header represents the class (‘a’,‘e’,‘l’,‘h’) by the classifiers.

Figure 7 – Samples of ‘l’ confounded with ‘e’ by the Naive Bayes, SVM, and MLP classifiers from the NON-TREMULOUS scribe (Ref.: s3-l-06, s3-l-13, s3-l-22, s3-l-30).

background noise may have improved the results obtained, with good results achieved for all three classifiers analysed.

4.2 Letter classification by writer

For the letter classification by writer, we have used subsets of the letters ‘a’, ‘e’, ‘h’ and ‘l’ each formed by the samples from the TREMULOUS, NON-TREMULOUS and

CALLIGRAPHER writers. Figure 8shows the accuracy rates obtained with the classifiers analysed.

The task of differentiating between the writers TREMULOUS, NON-TREMULOUS and CALLIGRAPHER present encouraging results, where the subset ‘a’ with lower accu-racy rates of 93.8%, 88.8% and 88.8% when using Naive Bayes, SVM and MLP classifiers, respectively.

(31)

Chapter 4. Experiments and results 28 Naive Bayes a e h l a 20 0 0 0 e 0 20 0 0 h 0 0 20 0 l 0 0 1 19 SVM a e h l a 20 0 0 0 e 0 20 0 0 h 0 0 20 0 l 0 0 0 20 MLP a e h l a 20 0 0 0 e 0 20 0 0 h 0 0 20 0 l 0 0 0 20

Table 5 – confusion matrices for the CALLIGRAPHER data set. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. The rows header indicate the predicted class (‘a’,‘e’,‘l’,‘h’) and the columns header represent the class (‘a’,‘e’,‘l’,‘h’) by the classifiers.

Figure 8 – Comparison between the Naive Bayes, SVM and MLP classifiers based on the scribes (TREMULOUS, NON-TREMULOUS and CALLIGRAPHER) for each letter (‘a’, ‘e’, ‘h’ and ‘l’).

confound the CALLIGRAPHER by NON-TREMULOUS (see samples incorrectly classified in Figure 9a, 9b). On the other hand, only the SVM and Naive classifiers assigned the NON-TREMULOUS writer with CALLIGRAPHER (see Figure 9c). We believe this is due to the similarity between the letter shapes produced by the two scribes.

On the other hand, the classification results of the letter ‘e’ present in the confusion matrices (see Table 7) show that most of the erroneous classification is of TREMULOUS instead of NON-TREMULOUS and vice-versa (see Figure 10). Problems in the texts of both scribes may be responsible for the erroneous classification of the letter, such

(32)

Naive Bayes

TRE NON CAL

TRE 29 1 0

NON 1 29 0

CAL 0 3 17

SVM

TRE NON CAL

TRE 29 1 0

NON 1 25 4

CAL 0 3 17

MLP

TRE NON CAL

TRE 29 1 0

NON 2 25 3

CAL 0 3 17

Table 6 – confusion matrices for the ‘a’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right.

(a) Samples of the letter ‘a’ from CAL-LIGRAPHER incorrectly classified as NON-TREMULOUS by all clas-sifiers (Ref.: s4-a-01,s4-a-09)

(b) Sample of the letter ‘a’ from CAL-LIGRAPHER incorrectly classified as the NON-TREMULOUS, by SVM and MLP classifiers (Ref.: s4-a-20)

(c) Samples of the letter ‘a’ from NON-TREMULOUS incorrectly classi-fied as CALLIGRAPHER by SVM and MLP classifiers (Ref.:s3-a-19, s3-a-22, s3-a-26)

Figure 9 – Samples of the letter ‘a’ erroneously classified by more than one classifier.

as background noise. In addition, the NON-TREMULOUS has dye stains, while the TREMULOUS data set has notable ink degradation.

The confusion matrices of the letter ‘h’ (Table 8) show similar results of the classifiers with encouraging accuracy levels. The distinctions in the shape of the letter may have motivated the results obtained (see Figure 11), although both documents of scribes

(33)

Naive Bayes

TRE NON CAL

TRE 84 6 0

NON 3 87 0

CAL 0 1 19

SVM

TRE NON CAL

TRE 85 6 0

NON 5 85 1

CAL 0 0 20

MLP

TRE NON CAL

TRE 86 5 0

NON 4 87 0

CAL 0 0 20

Table 7 – confusion matrices for the ‘e’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right.

(a) Sample of the letter ‘e’ from TREMULOUS incorrectly classi-fied as the NON-TREMULOUS by all classifiers (Ref.: s1-e-16)

(b) Sample of the letter ‘e’ from TREMULOUS incorrectly classi-fied as the NON-TREMULOUS by the Naive Bayes and MLP classi-fiers (Ref.: s1-e-04)

(c) Samples of the letter ‘e’ from TREMU-LOUS incorrectly classified as the NON-TREMULOUS by SVM and MLP classifiers (Ref.: s1-e-10,s1-e-16,s1-e-17,s1-e-74)

(d) Sample of the letter ‘e’ from NON-TREMULOUS incorrectly classi-fied as the TREMULOUS by all classifiers (Ref.: e-02, e-04, s3-e-69)

(34)

TREMULOUS and NON-TREMULOUS are degraded. Besides, the writer TREMULOUS presents a visible tremor in writing, and the writer NON-TREMULOUS has a slight inclination in writing, whereas, the CALLIGRAPHER has no background noise.

Naive Bayes

TRE NON CAL

TRE 20 0 0

NON 1 19 0

CAL 0 1 19

SVM

TRE NON CAL

TRE 19 1 0

NON 0 20 0

CAL 0 0 20

MLP

TRE NON CAL

TRE 20 0 0

NON 0 20 0

CAL 0 0 20

Table 8 – confusion matrices for the ‘h’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right.

(a) Sample of the let-ter ‘h’ from NON-TREMULOUS incor-rectly classified as the TREMULOUS by Naive Bayes classifier (Ref.: s3-h-09)

(b) Sample of the letter ‘h’ from CALLIGRAPHER incorrectly classified as the NON-TREMULOUS by Naive Bayes classifier (Ref.: s4-h-05)

(c) Sample of the letter ‘h’ from TREMULOUS in-correctly classified as the NON-TREMULOUS by SVM classifier (Ref.:s1-h-04)

Figure 11 – Samples of the letter ‘h’ erroneously classified

The confusion matrices for the letter ‘l’ (Table 9) present small variations between the results of the classifiers. All classifiers confound the NON-TREMULOUS by TREMU-LOUS (see Figure 12a, 12b and12c), and the CALLIGRAPHER by NON-TREMULOUS (see Figure 12d and 12e) and vice-versa. This letter has a single stroke with a similar

(35)

form in the writing of the scribes. Also, the dense noise presented in the samples of the NON-TREMULOUS may have contributed to the obtained results.

Naive Bayes

TRE NON CAL

TRE 28 3 0

NON 2 28 1

CAL 0 1 19

SVM

TRE NON CAL

TRE 29 2 0

NON 3 26 2

CAL 0 1 19

MLP

TRE NON CAL

TRE 27 3 1

NON 0 29 2

CAL 0 1 19

Table 9 – confusion matrices for the ‘l’ letter. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right.

From this results, it is possible to see that the letters present quite different be-haviours, where in some cases, writers are more erroneously classified than in others, mainly the NON-TREMULOUS by TREMULOUS and TREMULOUS by NON-TREMULOUS. The similarity in the writing period and noises presented in the handwritten documents of the authors (TREMULOUS and NON-TREMULOUS) may have caused for the erroneous classification between the scribes.

4.3 Classification of writers

The writer identification is a relevant task in the analysis of historical handwritten documents once the manuscripts of the Middle Ages were usually written and copied by anonymous scribes. Thus, the automation of the process of identification of the writer of historical manuscripts can be an important contribution.

Figure 13 presents the results obtained for the writer identification. The Naive Bayes classifier has the worse performance when compared with the other classifiers analysed. On the other hand, the MLP classifier obtains better accuracy in the identification of the writer with an average of 89% accuracy.

confusion matrices for the writer identification (Table 10) show that all writers are classified erroneously. However, it is possible to observe a relationship between the pairs of writers TREMULOUS with the NON-TREMULOUS and NON-TREMULOUS with CALLIGRAPHER. This relationship is due to some features of the writing styles of

(36)

(a) Sample of the letter ‘l’ from TREMULOUS in-correctly classified as the NON-TREMULOUS by all classifier (Ref.: s1-l-06)

(b) Sample of the letter ‘l’ from TREMULOUS incorrectly classi-fied as the NON-TREMULOUS by Naive Bayes and MLP classifiers (Ref.: s1-l-21) and by Naive Bayes and SVM classifiers (Ref.: s1-l-22)

(c) Sample of the letter ‘l’ from CALLIGRAPHER incorrectly classified as the NON-TREMULOUS by Naive classifier (Ref.:s3-l-17, s3-l-19) and by SVM classifier (Ref.:s3-l-05,s3-l-15,s3-l-26)

(d) Sample of the letter ‘l’ from CAL-LIGRAPHER incorrectly classified as the NON-TREMULOUS by all classifiers (Ref.:s4-l-15)

(e) Sample of the letter ‘l’ from NON-TREMULOUS incorrectly classi-fied as the CALLIGRAPHER by all classifiers (Ref.:s3-l-22)

Figure 12 – Samples of the letter ‘l’ erroneously classified by more than one classifier

each scribe and the documents, mainly background problems of the historical manuscripts caused by different factors.

(37)

Figure 13 – Comparison between the Naive Bayes, SVM and MLP classifiers based on scribe ( TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER).

Naive Bayes

TRE NON CAL

TRE 138 25 8

NON 15 132 24

CAL 6 17 57

SVM

TRE NON CAL

TRE 151 16 4

NON 14 147 10

CAL 3 4 73

MLP

TRE NON CAL

TRE 149 18 4

NON 16 147 8

CAL 1 4 75

Table 10 – confusion matrices for the writer identification. The two confusion matrices on the top corner are from Naive Bayes classifier (left), SVM classifier (right) and the last is about the MLP classifier. Rows indicate the labels TREMU-LOUS, NON-TREMULOUS and CALLIGRAPHER from top to bottom and the columns represent the labels TREMULOUS, NON-TREMULOUS and CALLIGRAPHER from left to right.

4.4 Analysis of metrics on characters samples

The Global metrics (HPP and VPP) measure the pixel density of the characters. The metrics indication the amount of ink deposited in the samples, based on the number of pixels accumulates. The HPP denotes the sum pixels along the rows and VPP of the columns in the image.

For the letter ‘a’, the samples used are shown in Figures 14a, 14b, and 14c of TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER respectively. We can observe that the results of the HPP and VPP metrics present a curved shape of different writers, notably distinct (see Figure 15). It is important also to notice that the TREMULOUS

(38)

has greater dissimilarity between the curves of each sample in HPP and VPP than the other scribes. The NON-TREMULOUS has little irregularities between rows 60 and 100, around the letter centre, due to changes in stroke thickness and shape of the samples. The CALLIGRAPHER has uniform conduct with irregular peaks in VPP metric, but with a similar curve shape.

(a) Letters samples from TREMU-LOUS scribe (Ref.: s1-a-1,s1-a-2,s1-a-3,s1-a-4,s1-a-5).

(b) Letters samples from NON-TREMULOUS scribe (Ref.: s3-a-1,s3-a-2,s3-a-3,s3-a-4,s3-a-5)

(c) Letters samples from CALLIGRA-PHER scribe (Ref.: s4-a-1,s4-a-2,s4-a-3,s4-a-4,s4-a-5)

Figure 14 – The letter ‘a’ samples used to analyse character features

For the letter ‘e’, the results are showed in Figure 17, and the samples used are presented in Figures 16a, 16b, and 16c of TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER respectively. The samples of TREMULOUS data set have great dis-similarity amongst its samples with a small discrepancy in the curves, mainly in VPP results due to writing failures and degradation in documents. The NON-TREMULOUS and TREMULOUS have peaks in the VPP results at the end of the curve, because the letter ‘e’ from NON-TREMULOUS has a stroke on the upper right side in some samples and the CALLIGRAPHER has a long stroke at the down of some letter.

The samples of the letter ‘h’ are presented in Figures 18a, 18b, 18c, and in Figures 20a, 20b, and 20c of the letter ‘l’ of TREMULOUS, NON-TREMULOUS, and CALLIGRAPHER respectively. In comparison with the letters ‘a’ and ‘e’, the ‘h’ and ‘l’ has less dissimilarity between samples from TREMULOUS with HPP metric (see Figure

19, 21). However, the VPP metric has great irregular curves of all writers, mainly the TREMULOUS and NON-TREMULOUS scribes.

In general, it is possible to observe that the character samples from each writer had similar behaviour between itself, in HPP metric. On the other hand, the results

(39)

Chapter 4. Experiments and results 36 0 50 100 150 200 0 2000 4000 6000 8000 HPP TREMULOUS letter a 0 50 100 150 200 0 2000 4000 6000 8000HPP NON-TREMULOUS letter a 0 50 100 150 200 0 2000 4000 6000 8000HPP CALLIGRAPHER letter a 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000 VPP TREMULOUS letter a 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000VPP NON-TREMULOUS letter a 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000VPP CALLIGRAPHER letter a

Figure 15 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘a’. The x-axis represents the height in HPP and width in VPP of the character in pixels and y-axis is the sum of pixel values belonging to the character.

(a) Letters samples from TREMU-LOUS scribe (Ref.: s1-e-1,s1-e-2,s1-e-3,s1-e-4,s1-e-5)

(b) Letters samples from NON-TREMULOUS scribe (Ref.: s3-e-1,s3-e-2,s3-e-3,s3-e-4,s3-e-5)

(c) Letters samples from CALLIG-RAPHER scribe (Ref.: s4-e-1,s4-e-2,s4-e-3,s4-e-4,s4-e-5)

(40)

Chapter 4. Experiments and results 37 0 50 100 150 200 0 2000 4000 6000 8000 HPP TREMULOUS letter e 0 50 100 150 200 0 2000 4000 6000 8000HPP NON-TREMULOUS letter e 0 50 100 150 200 0 2000 4000 6000 8000HPP CALLIGRAPHER letter e 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000 VPP TREMULOUS letter e 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000VPP NON-TREMULOUS letter e 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000VPP CALLIGRAPHER letter e

Figure 17 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘e’. The x-axis represents the height in HPP and width in VPP of the character in pixels and y-axis is the sum of pixel values belonging to the character.

(a) Letters samples from TREMULOUS scribe (Ref.: s1-h-1,s1-h-2,s1-h-3,s1-h-4,s1-h-5)

(b) Letters samples from NON-TREMULOUS scribe (Ref.: s3-h-1,s3-h-2,s3-h-3,s3-h-4,s3-h-5)

(c) Letters samples from CALLIGRAPHER scribe (Ref.: s4-h-1,s4-h-2,s4-h-3,s4-h-4,s4-h-5)

(41)

Chapter 4. Experiments and results 38 0 50 100 150 200 0 2000 4000 6000 8000 HPP TREMULOUS letter h 0 50 100 150 200 0 2000 4000 6000 8000HPP NON-TREMULOUS letter h 0 50 100 150 200 0 2000 4000 6000 8000HPP CALLIGRAPHER letter h 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000 VPP TREMULOUS letter h 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000VPP NON-TREMULOUS letter h 0 20 40 60 80 100 0 2000 4000 6000 8000 10000 12000 14000VPP CALLIGRAPHER letter h

Figure 19 – HPP and VPP metrics values for samples of the scribes TREMULOUS, NON-TREMULOUS and CALLIGRAPHER of the letter ‘h’. The x-axis represents the height in HPP and width in VPP of the character in pixels and y-axis is the sum of pixel values belonging to the character.

(a) Letters samples from TREMU-LOUS scribe (Ref.: s1-l-1,s1-l-2,s1-l-3,s1-l-4,s1-l-5)

(b) Letters samples from NON-TREMULOUS scribe (Ref.: s3-l-1,s3-l-2,s3-l-3,s3-l-4,s3-l-5)

(c) Letters samples from CALLIGRAPHER scribe (Ref.: s4-l-1,s4-l-2,s4-l-3,s4-l-4,s4-l-5) Figure 20 – The letter ‘l’ samples used to analyse character features.

An experimental investigation of letter identification and scribe predictability in medieval manuscripts

An experimental investigation of letter

identification and scribe predictability in

medieval manuscripts

Francimaria Rayanne dos Santos Nascimento

Francimaria Rayanne dos Santos Nascimento

An experimental investigation of letter identification

and scribe predictability in medieval manuscripts

Supervisor

Márjory Cristiany Da Costa Abreu

Natal-RN

January 2020

Acknowledgements

An experimental investigation of letter identification

and scribe predictability in medieval manuscripts

ABSTRACT

List of Figures

List of Tables

Contents

1 Introduction

1.1

Motivation

1.2

Objectives

1.3

Overview

2 Related work

2.1

Analysis of historical documents using technological

pro-cess

2.2

Analysis of historical documents in the context of

dis-ease and disorders

2.3

Chapter summary

3 Understanding how to analyse user

and letter identification from

me-dieval data

3.1

Pre-processing

3.2

Feature extraction

3.3

Classification and character recognition

3.4

Character analysis

3.4.1

Global metric

3.4.2

Local metric - Pixel Density Variation (PDV)

3.4.3

Statistical analysis of metrics

3.5

Chapter summary

4 Experiments and results

4.1

Classification of letters

4.2

Letter classification by writer

4.3

Classification of writers

4.4

Analysis of metrics on characters samples