Robust Recognition of Chinese Text from Cellphone-acquired Low-quality Identity Card Images Using Convolutional Recurrent Neural Network

An automatic reading of text from an identity (ID) card image has a wide range of social uses. In this paper, we propose a novel method for Chinese text recognition from ID card images taken by cellphone cameras. The paper has two main contributions: (1) A synthetic data engine based on a conditional adversarial generative network is designed to generate million-level synthetic ID card text line images, which can not only retain the inherent template pattern of ID card images but also preserve the diversity of synthetic data. (2) An improved convolutional recurrent neural network (CRNN) is presented to increase Chinese text recognition accuracy, in which DenseNet substitutes VGGNet architecture to extract more sophisticated spatial features. The proposed method is evaluated with more than 7000 real ID card text line images. The experimental results demonstrate that the improved CRNN model trained only on the synthetic dataset can increase the recognition accuracy of Chinese text in cellphone-acquired low-quality images. Specifically, compared with the original CRNN, the average character recognition accuracy (CRA) is increased from 96.87 to 98.57% and the line recognition accuracy (LRA) is increased from 65.92 to 90.10%.


Introduction
Identity (ID) cards are a kind of legal certificate to prove the residential ID of the holder in China and are widely used in all aspects of modern social life. It is necessary to input ID card information when doing business involving the government, public security, banking, securities, insurance, taxation, and so forth. Manually inputting ID card information is inefficient and prone to errors, and it is not possible to input unknown words. It will greatly improve the work efficiency and service level if ID card information can be read automatically.
The most commonly used device for reading ID card information is an ID card reader, which is based on the induction principle of magnetic cards and RFID technology. The device can quickly and accurately read the information stored in a second-generation ID card chip, but this technology needs close contact between the ID card and the card reader. In recent years, with the integration of internet technology and traditional industries in China, doing business online is becoming increasingly popular. To read customer information automatically from uploaded ID card images, optical character recognition (OCR) technology is needed.
OCR refers to using optical equipment to obtain images containing text, and then converting the text in the image into computer-readable and editable character codes through digital image processing and pattern recognition methods. ID card images can be obtained from a scanner or a camera. The OCR technology applied to scanned ID card images is mature, and the recognition accuracy has reached over 99%. (1) However, recognizing text from camera-acquired low-quality ID card images is still a challenging task.
OCR can be divided into two stages: text detection and text recognition. The goal of text detection is to produce segmentations or bounding boxes of texts in the whole image, while text recognition aims at converting a cropped text image to text strings. This paper only focuses on text recognition. There are Chinese characters, English letters, numbers, and punctuations on an ID card, which are printed horizontally in lines. A convolutional recurrent neural network (CRNN) is the most popular model for recognizing regular texts owing to its capability of acquiring competitive results with relatively few parameters. (2) The architecture of a CRNN consists of three components: convolutional layers, recurrent layers, and a transcription layer. The convolutional layers automatically extract a sequence of features from each input image, the recurrent layers predict a label distribution for each frame in the feature sequence, and the transcription layer translates the per-frame predictions into the final label sequence. (3) However, the CRNN was originally designed for English character recognition. Compared with the 52 English characters (i.e., 26 lower-case and 26 upper-case letters), there are thousands of Chinese characters including more than 6000 commonly used ones. Furthermore, many Chinese characters appear similar, e.g., " 日 " and " 曰 ", " 土 " and " 士 ", and " 治 " and " 冶 ". These differences call for a more complicated model to extract more sophisticated structure features to recognize Chinese characters. The first contribution of this paper is that a novel convolutional neural network (CNN) is introduced into a CRNN model to replace the original convolutional layers for the extraction of more sophisticated structure features in Chinese characters.
The supervised training of a large model such as a CRNN, which contains millions of parameters, requires a very large amount of labeled training data. Owing to the privacy associated with ID cards, it is impossible to build a large-scale training dataset consisting of real ID card images except for public security organizations. Synthetic datasets provide detailed ground-truth annotations, which are cheap and scalable alternatives to annotating images manually. They have been widely used to learn scene text recognition models (4,5) and scene text detection models. (6) The second contribution of this paper is that a novel ID card text image generator (G) based on a conditional generative adversarial network (cGAN) named pix2pix (7) is proposed, which is capable of emulating ID card text images in a natural environment in the case of a small number of real ID card images.

Related Works
Text recognition methods can be broadly divided into three categories: character-based, word-based, and sequence-based methods.
Character-based recognition methods generally consist of three steps: character detection, character recognition, and character combination. Wang et al. used random ferns and a histogram of oriented gradient (HOG) features to detect characters, then found an optimal configuration of a particular word via a pictorial structure. (8) Mishra et al. detected character candidates using sliding windows and integrated both bottom-up and top-down cues in a unified conditional random field (CRF) model. (9) Bissacco et al. used a neural network classifier acting on the HOG features of the segments as scores to find the best combination of segments using beam search. (10) Jaderberg et al. used a combination of a binary text/no-text classifier, a character classifier, and a bigram classifier densely computed across the word image as cues to a Viterbi scoring function in the context of a fixed lexicon. (11) Character-based recognition methods require robust and accurate character detection and recognition, otherwise the word alignment will lead to incorrect results due to error accumulation from lower to higher levels.
Word-based recognition methods treat each word image as a whole without requiring character detection and recognition. Goel et al. converted the word recognition task into a problem of retrieving the best match from a lexicon image set with a weighted dynamic time warping approach. (12) Almazán et al. embedded word images and word labels into a common Euclidean space, and used embedding vectors to match images and labels. (13) Jaderberg et al. treated text recognition as an image classification problem. Each class corresponded to one English word in a pre-defined large dictionary composed of around 90k words. (14) However, lexicon-driven word recognition methods lack flexibility and cannot recognize a rarely occurring word that is not included in the lexicon.
Sequence-based recognition methods regard text recognition as an image-based sequence recognition problem, where images and texts are separately encoded as patch and character sequences. Su and Lu extracted a sequential image representation, which is a sequence of HOG descriptors, and predicted the corresponding character sequence with a recurrent neural network (RNN). (15) Shi et al. proposed an end-to-end neural network architecture that combined CNN and RNN for visual feature representation, then the connectionist temporal classification (CTC) loss (16) was combined with the RNN outputs to calculate the conditional probability between the predicted and target sequences. (3) Inspired by the sequence-to-sequence framework for machine translation, (17) Lee and Osindero used a recursive RNN to learn broader contextual information and applied an attention-based decoder for sequence generation. (18) Cheng et al. proposed a focus mechanism to eliminate the attention drift to improve the recognition performance of regular text. (19) Bai et al. proposed an edit probability metric to handle the misalignment between the ground-truth string and the attention's output sequence of a probability distribution. (20) Both CTC and encoder-decoder frameworks were originally designed for 1D sequential input data, and therefore applied to the recognition of straight and horizontal text, which can be encoded into a sequence of feature frames without losing important information. In contrast to CTC, the decoder module of the encoder-decoder framework is an implicit language model, so it can incorporate more linguistic priors. For the same reason, the encoder-decoder framework requires a larger training dataset with a larger vocabulary. Otherwise, the model may degenerate when reading words that are not seen during training. In contrast, CTC is less dependent on language models and has a better character-to-pixel alignment. Therefore, it is potentially better on languages such as Chinese and Japanese that have a large character set. (2)

Synthetic ID Card Text Line Image
There is a standard template for Chinese ID cards, such as the font, size, spacing, and color. We construct a corpus based on the content of Chinese ID cards. So that the corpus is similar to the Chinese ID card text distribution, the text of names is randomly selected from a Chinese name corpus, (21) the text of addresses comes from a random combination of the different levels of administrative divisions in a China area corpus, (22) and the texts of gender, nationality, date of birth, and ID card number are randomly selected from their value domains. A punctuation mark is inserted between texts of different contents. Some uncommon characters are supplemented to mitigate the problem of imbalanced samples.
The process of generating a synthetic ID card text line image is shown in Fig. 1. Ten consecutive characters are extracted from any position in the corpus to generate a binary text line image with size 32 × 280. Next, the binary text line image is distorted with a random, full perspective transformation, simulating the 3D world. Because the input image size of G is fixed at 256 × 256, the binary text line image and its seven duplicates are mosaicked into one image and resized to 256 × 256. The synthetic ID card text image output from G is split into eight identical sub-images from top to bottom, one of which is selected as the synthetic ID card text line image and resized to 32 × 280. Finally, Gaussian noise, out-of-focus blur, and so forth, are added to the synthetic ID card text line image with random intensity.
We use a cGAN named pix2pix (7) to train G to learn mapping from binary text images to ID card text images. The process of training G is shown in Fig. 2. G learns to translate binary text images x to synthetic ID card text images G(x) that cannot be distinguished from the corresponding real ID card text images y by an adversarially trained discriminator (D), while simultaneously D learns to classify between fake {G(x), x} and real {y, x}.
The objective of the pix2pix network can be expressed as * 1 arg min max ( , ) ( ) [ ] λ in Eq. (1) controls the relative importance of the two objectives. We set λ to 100 to encourage the output of G to be less blurring. Figure 3 illustrates some samples generated by G, which is trained with 613 real ID card text images and their corresponding binary text images.

CRNN architecture (3)
A CRNN is an end-to-end training neural network for image-based sequence recognition, whose architecture consists of three components: convolutional layers, recurrent layers, and a transcription layer from bottom to top, as shown in Fig. 4. The convolutional layers automatically extract a feature sequence x = x 1 , ..., x T from each input image, where T is the sequence length. The recurrent layers predict a label distribution y t for each frame x t . The transcription layer converts the per-frame predictions y = y 1 , ..., y T into a label sequence l. Mathematically, transcription is finding the label sequence l that maximizes P(l|y), where P(l|y) is defined in the CTC layer proposed by Graves et al. (16) We denote the training dataset by { , } , where I i is the training image and l i is the ground-truth label sequence. The objective is to minimize the negative log-likelihood of the conditional probability of ground truth:  (7) . G learns to translate binary text images x to synthetic ID card text images G(x) that cannot be distinguished from the corresponding real ID card text images y by D, while simultaneously D learns to classify between fake {G(x), x} and real {y, x}.
where y i is the sequence produced from I i by the recurrent and convolutional layers. The objective function calculates a cost value directly from an image and its ground-truth label sequence, so the network can be end-to-end trained on pairs of images and sequences.

Improved feature sequence extraction module
The CRNN was originally designed for English character recognition, and its architecture of convolutional layers is based on the VGG-VeryDeep architecture, (23) which is prone to losing fine spatial features. Compared with English characters, Chinese characters have a more sophisticated spatial structure and a more similar appearance. To improve Chinese  (1) text recognition accuracy, a novel feature sequence extraction module based on a dense convolutional network (DenseNet) architecture (24) is proposed in this paper as shown in Fig. 5. Figure 6 illustrates a five-layer dense block with a growth rate of k = 4. We assume that the network comprises L layers, each of which implements a nonlinear transformation H ℓ (•), where ℓ indexes the layer and H ℓ (•) is a composite function of three consecutive operations: batch normalization (BN), (25) a rectified linear unit (ReLU), (26) and a 3 × 3 convolution (Conv). We denote the input feature map as x 0 and the output of the ℓth layer as x ℓ . Then, where [x 0 , x 1 , ..., x ℓ−1 ] refers to the concatenation of the feature maps produced in layers 0, 1, ..., ℓ − 1.
If each function H ℓ (•) produces k feature maps, it follows that the ℓth layer has κ 0 + k × (ℓ − 1) input feature maps, where κ 0 is the number of channels in the input layer. The transition layers consist of a BN layer and a 1 × 1 convolutional layer followed by a 2 × 2 average pooling layer. The proposed feature sequence extraction module has three dense blocks, with each block having eight layers. Before entering the first dense block, a convolution with 64 output channels is performed on input images. For convolutional layers with kernel size 3 × 3, each side of the inputs is zero-padded by one pixel to keep the feature map size fixed. A 1 × 1 convolution followed by 2 × 2 average pooling is used as the transition layer between two contiguous dense   (24) blocks. At the end of the last dense block, 4 × 1 average pooling is performed to extract the feature sequence. We set the text line image size to 32 × 280 and the growth rate to k = 8. The exact network configuration is shown in Table 1.

Implementation details
Approximately 2000 real ID card images taken by cellphone camera are provided by a construction company under a privacy agreement that prohibited us from revealing the full information of any individual. These images are taken from diverse angles and distances under various lighting conditions by different cellphone brands and models. We cut them into text line images, 613 of which are used to train the pix2pix network and 7824 are used to evaluate the performance of the improved CRNN. Experiments are carried out on a workstation with a 3.4 GHz Intel i7-330 CPU, 16 GB RAM, and an 8 GB NVIDIA GTX 1080 GPU.
The pix2pix network is implemented in TensorFlow 1.2.0. The optimization method is Adam with a learning rate of 0.0002 and momentum parameters β 1 = 0.5 and β 2 = 0.999. The batch size is set to 1. The maximum number of iterations is set to 100k. After 90k iterations, ( ) 1 L G  (G) becomes less than 0.1. When the training is done, the method proposed in Sect. 2 is used to generate synthetic ID card text line images. The synthetic dataset contains 6.6 million images covering 7265 types of characters in total.
The improved CRNN is implemented in Caffe with CUDA 8.0 and cuDNN 5.6. The optimization method is Nesterov with a learning rate of 0.0001, a momentum of 0.9, and γ of 0.5. The batch size is set to 64. To verify that the synthetic data is sufficiently realistic to substitute for real data, we only use the synthetic data for training and real data for testing. The training process takes about 20k iterations to reach convergence. Test images are scaled to a height of 32, and the image width is proportionally scaled with height. The image width is at least 280 pixels, and we apply zero-padding for short images.  Figure 7 shows samples correctly recognized in the test dataset. It can be seen from Fig. 7 that even if the text line images are affected by noise, blur, uneven illumination, perspective distortion, a complex background, and so forth, the improved CRNN can still accurately recognize the text in the images and maintain good robustness.

Results
To further verify the effectiveness of the improved CRNN, we compare it with the original CRNN in quantitative and qualitative analyses. We use two metrics to quantitatively evaluate the recognition performance: (1) the average character recognition accuracy (CRA) based on the longest common subsequence (LCS), defined as where l represents the predicted label sequence and l represents the ground-truth label sequence; (2) line recognition accuracy (LRA), i.e., the percentage of text line images correctly recognized, where the text line image is correctly recognized if no character is misidentified. Table 2 shows the text recognition accuracies of the improved and original CRNNs. Compared with the original CRNN, CRA is increased from 96.87 to 98.57% and LRA is increased from 65.92 to 90.10% for the improved CRNN. Table 3 lists some images with different recognition results. From the qualitative perspective, the improved CRNN can correctly recognize easily confused Chinese characters in the case of a complex background (a and b in Table 3) and a slanting text line (c and d) owing to higher feature extraction capabilities than the original CRNN. To analyze the shortcomings of the improved CRNN, we list some incorrectly recognized samples  in Table 4. It is easy to see from the images in Table 4 that recognition errors are mainly caused by low resolution (a and b), out-of-focus blur (c), and character interference (d). Thus, we still need to design a much finer network structure that can extract fine-grained features.

Conclusions
In this paper, we propose a novel CRNN for Chinese character recognition from ID card images taken by cellphone cameras that integrates the advantages of both the CRNN architecture and the DenseNet architecture. The CRNN is capable of taking input images of various dimensions and produce predictions with different lengths. It directly runs on coarse level labels, requiring no detailed annotations for each individual element in the training phase. DenseNet allows feature reuse throughout the networks and can consequently learn more compact and accurate internal representations. We have also designed a synthetic data engine based on a conditional adversarial generative network to generate million-level synthetic ID card text line images, which can not only retain the inherent template pattern of ID card images, but also preserve the diversity of synthetic data. We evaluate the performance of the proposed method with more than 7000 real ID card text line images, and the experimental results demonstrate that the improved CRNN model trained only on the synthetic dataset can increase the recognition accuracy of Chinese text in cellphone-acquired low-quality images. Specifically, compared with the original CRNN, the average CRA is increased from 96.87 to 98.57% and the LRA is increased from 65.92 to 90.10%. The proposed Chinese text recognition method has been used to read personal information from cellphone-acquired ID card images in an employee management system of a construction company that adopts manual interaction to ensure the  accuracy of the input information. For the ID card images whose quality does not meet the requirements, the administrator will return them to the users for resubmission.