[PENDING] Visual Representation Learning for Document Image Recognition

List of Figures 39 4.8 Overview of the proposed mPOG descriptor extraction on an image part and. Recognition on Seq2Seq is considered and QbS is performed using the proposed autoencoder module.

Document Analysis and Recognition

Text recognition can rightly be considered the most crucial task of the document processing field and the existence of many different writing styles (between different people or even between eras) and languages adds a significant complexity to the problem. A keyword tracking system returns a ranked list of the word images with respect to the degree of similarity to the query.

Contributions and Thesis Structure

More importantly, we develop a deep learning approach that combines both word recognition and word detection, achieving state-of-the-art results in both tasks. This approach achieves state-of-the-art results across different settings and datasets, including the compression of a keyword detection network, presented in the previous chapter.

Neural Networks

In practice, RNNs have several layers, i.e. the output sequence of the first RNN is fed as an input sequence to the second RNN, etc. Define a loss function L(ˆyi, yi), which determines the closeness of the prediction ˆyi to the required targetyi.

Handwritten Text Recognition

Final recognition is performed by a Viterbi-like decoding process called CTC decoding, which takes into account CTC properties (e.g., an extra blank character used to separate consecutive identical characters. The simplest form of a language model is the frequency of occurrence of n-grams of characters, given specific corpora, which serve as an advantage over the existence of specific n-grams generated by the visual recognition step.

Keyword Spotting

A different taxonomy of keyword discovery methods is related to the existence of a training phase. The most characteristic method of the second category (function set construction) refers to the work of Rath and Manmath [141].

Projections of Oriented Gradients (POGs)

The calculation of the directional gradients Gx and Gy of the image I(x, y), along the x-axis (horizontal) and the y-axis (vertical), respectively, is performed using the following filter kernels: [−1 0 1] and [− 10 1]T. (xc, yc) indicates the center of the image and θ indicates the chosen projection angle. To evaluate the efficiency of the descriptor while changing the number of complex Fourier coefficients (nc), a linear SVM classifier is used for the CEDAR (only for the merged scenario), CIL and GRPOLY-DB (only SC-2) datasets.

The classification results for the two scenarios of the GRPOLY-DB database are shown in Table 3.3.

Discovering Character Classes: A User-in-the-Loop Case

The workflow of the proposed methodology is presented in Figure 3.6, and its components are described in detail in the following subsections. Specifically, we define a set of simple user-selectable actions to help improve existing clustering. This can be achieved in the next iteration of the proposed system after registering user actions.

The final stage of the proposed system is the character recognition according to the generated character classes.

Word Image Representation using Horizontal Zoning

A normalization of the word image to a fixed size is not necessary, because the POG descriptor is size invariant. Moreover, we use a robust regression method to find the root zone of the word. Height Normalization: We place the main zone at the center of the generated normalized image (y-axis), which will promote the global approximation of the proposed feature extraction scheme.

The final descriptor of each word image is the concatenation of the generated POG feature vectors k.

Sequential Matching

-Free KWS The angle with the maximum valueJθ defines the slope of the main zone. On the contrary, the proposed matching algorithm (SM) corresponds to the formulation and constraints of the specific problem. It is clear that the retrieval time has linear dependence on the percentage of the selected subset.

Nevertheless, the retrieval time of the proposed method is still sufficiently low for a real-time KWS application.

Multiple embedding of datasets of word representations, providing high-quality embeddings, even when the dimensionality of the embedding is lower than the underlying multidimensionality. Most of the multiple embedding methods do not support adding a new sample to the already learned embedding. Parametric approaches assume that the learned input can be modeled from a (non-linear) combination of the initial data together with a set of parameters.

Although one can realize the effectiveness of manifold embedding in reducing the descriptor size without significant impact on search performance, the presented system has some notable drawbacks that mainly derive from Isomap embedding, such as its sensitivity to the choice of embedding dimensionality.

Out-of-sample (OOS) Extension of t-SNE

Manifold embedding of word representations a sort of the Euclidean distance between the generated query feature vector and the already embedded word features. We assume a set of pointsxi which correspond to the descriptors of the word images for the KWS task and their embedding is the result of the t-SNE optimization. In order to further promote the simplicity of the update equation and the rate convergence, we choose to omit the terms in Eq.

The above formulation only requires the distances of the new point x from the existing pointsxi, i.e.

Experimental Evaluation

This approach assumes that the local structure of the embedding space is defined only by pairwise similarities between the initial points [153]. To highlight this behavior, 90% of the points are randomly selected to estimate the parameter matrixA. The proposed OOS extension is used to obtain the low-dimensional embedding of the query image.

In addition, the performance of the most modern KWS methods is given for comparison.

Deep Features: Extracting Features from CNNs

In other machine vision tasks, deep features have often led to superior results compared to standard network usage [96], as they are able to capture more abstract features of the input [193]. 1 For example, the dimensionality of the features generated by layerspp (closest to the mesh input) is 10,752. The test parameters were (a) extracting deep features from different layers of the network (we compare spp, f c1, f c2) (b) applying t-SNE to produce low- or no-dimensional embeddings (c) using different dissimilarity measures (compare BC, L2, L2-normal).

In Table 6.2, we present a list of the best performing deep features and a comparison of their performance in terms of extracting descriptors from the output layer.

An Exploration on Architecture and Training Strategies

The reasoning behind the layouts is that different designs will perform well in different parts of the entryway. Up to this point, we have examined the performance of various spatial clustering schemes and argued that using fixed-size images does not affect performance. Therefore, a further question should be: is an adaptive spatial pooling layer necessary when the input images are of the same size?.

Basic models PHOC+bare architecture and PHOC+zoning are used as an indication of the capabilities of the PHOCNet architecture.

Compressed Deep Features

For this purpose, we use the response of the final convolution layer as a deep function. The N vectors produced as output from the modified TPP layer are fed to the fully connected layer and N output vectors are generated. An important observation is that the training of the compact model, assuming random initialization, converges to suboptimal solutions.

Both networks, extended and compact, achieve remarkable results that are on par with state-of-the-art results (extended network outperforms reported methods).

WSRNet: Joint Spotting and Recognition

The output of the 1-D CNN is of size w×nclasses, where class is the number of possible character classes. The decoder network, given a hidden vector and the previous element of the sequence, predicts the next element. The Seq2Seq branch is combined with the rest of the network through a loss of several tasks.

If we limit the output of the character coder to be similar to the output of the visual coder, e.g.

Feature-based Extension to Line-Level Spotting

We follow a simple procedure for assigning the average character width (label) to each image blob. Apply the convolutional part of the PHOC estimator network (fc, see section 4) to the image transformed in step 1. We experimented on the impact of the pooling operation on the system performance by evaluating three different strategies at the adaptive pooling layer:.

Specifically, using the width estimator, we obtain segments of the convolutional feature map that correspond to the search width.

Related Work

The output of the CRC layer is a concatenation of the generated hidden variables d, which can be considered as a sequence generated by the underlying RNN. The output feature map of the CRC layer is formed by concatenating the hidden state segments {hi}. By default, we use 3 × 3 convolutional filters to process the input and hidden state of each step of the CRC layer.

In this section, we first evaluate several options regarding the design of the CRC layer (see section 7.2.1).

Weight Pruning: Inducing Sparsity Over Weights

Therefore, we can define the overall sparsity of the network by combining the layer-wise sparsity into an additional loss function that is optimized along with the task-related loss. In practice, we only need to calculate the standard deviation σ of the weights to find the appropriate b. Nevertheless, to fully describe the training process of the proposed work, we should define the derivatives ϑwei/ϑwi and ϑwei/ϑb.

The gradient is minimized when either the majority of the weights are untrimmed (corresponding to a low bound b) or a well-functioning model exists and no further action is needed (ϑL/ϑwei → 0).

KWS using Compressed Models

The training scheme is exactly the same with the addition of an additional loss. Experimental results are summarized in Tables 7.11 and 7.12, where WSRNet sparse instances are evaluated against state-of-the-art techniques for both word recognition and word detection. The sparse version uses only 0.7 million parameters for the CNN backbone, achieving a total sparsity of 87.6%.

The results show that we have little or no performance loss for both tasks (recognition and observation), which is still competitive with state-of-the-art approaches, although the reported compression is noticeable.

Summary of Main Contributions

Exploring deep features: Moving on to the deep learning research area, we explored several aspects of deep learning (architecture, training strategies) under the lens of keyword observation to generate robust representations of word images from convolutional networks. This idea was the main motivation behind the deep learning methodology we developed, called WSRNet, which can simultaneously perform observation and recognition and achieve state-of-the-art results in both tasks. This vector is then decoded by the Decoder module into the target character sequence ie.

Line-level KWS: Most existing KWS deep learning techniques assume that a word segmentation step has been preceded.

Future Directions