Fine-grained Vehicle Classification Technology Based on Fusion of Multi-convolutional Neural Networks

With the development of cities, the rapid growth of vehicle ownership has given rise to traffic violations and traffic safety problems resulting in casualties. Therefore, with the rise of intelligent transportation in smart cities, intelligent traffic video monitoring systems have attracted considerable attention. The industry and academia have begun to consider the problem of increasing the intelligent functions of video monitoring systems. The intelligent functions of current intelligent traffic video monitoring systems focus on object detection and tracking, and abnormal situation alarms, for example. As the core functions of intelligent traffic video monitoring systems involve vehicle detection and classification of fine-grained problems, research in this area is very difficult. Moreover, no good, substantial products are available, especially with the fine-grained vehicle classification function, and no practical research has addressed this issue. In this paper, we propose an approach based on the fusion of convolutional neural networks (CNNs) to solve the problem of vehicle detection and finegrained classification. In the fine-grained vehicle classification problem, the differences in a class are greater than the differences between classes. As a result, the classification accuracy is not sufficiently high to achieve efficient fine-grained vehicle classification.


Introduction
As an application field of target detection, vehicle detection has been widely used in many industries. After vehicle detection, fine-grained classification identification is required. Owing to the small difference between fine classification objects and the fact that the difference within a class is often greater than that between classes, it is difficult to study the problem of vehicle fine size classification. At present, there are no practical research results related to this challenging research topic.
Fine-grained classification is a subdomain of target identification, and its main purpose is to distinguish various subclasses under the same basic category. Different from the general coarse classification of objects, fine-grained classification distinguishes similar objects that are visually very similar. Take the fine-grained dogs classification as an example. Poodles and bichon are very similar in appearance. Fine-grained classification aims to accurately identify different dog species.
At present, research on fine-grained classification focuses on the expression of the image level to the expression of the mining semantic component level, and looks for clues in details. Training methods range from all end-to-end-oriented methods to the integration of a variety of methods, all of which make full use of prior knowledge in the training process. However, a large number of studies have focused on the problem areas of birds, (1)(2)(3)(4)(5)(6)(7)(8) cats, (4) flowers, (4) aircraft, (8) dogs, (1,4,6,9) pedestrians, (10) and actions. (11) Relevant studies have also been carried out on the fine particle size classification of vehicles, (4,5,8,9) but there has been no specific research on the fine particle size classification of traffic. Few studies have addressed finegrained vehicle classification owing to the lack of relevant standard data sets. Although in some early studies (12) vehicles were classified in a fine-grained way, those studies were mainly limited to the front and back views of vehicles. After the license plate was detected, previous studies extracted the region of interest (ROI) to generate a feature vector for classification. Stark et al. (13) also achieved good results in their fine-grained vehicle classification studies using the deformable part model (DPM). Prokaj and Medioni (14) used a 3D model of a vehicle to perform vehicle pose estimation, then projected it onto the 2D plane, and finally used the scale-invariant feature transform (SIFT) operator to compare different vehicles so as to classify vehicle finegrain size. Their approach can solve the problem of inaccurate classification. Krause et al. (15) used a 3D CAD model to train shape classifiers to further improve the classification effect of Prokaj and Medioni. (14) Lin et al. (16) proposed the use of a 3D active shape model to capture vehicle trademark components so as to achieve fine-grained vehicle classification, and obtained better classification results than by other methods on their FG3DCar data set. Krause et al. (17) proposed the use of the convolutional neural network (CNN) method to study the distinctive components of a vehicle using these significant pieces to achieve fine-grained classification. All of the above works are based on the premise that all incoming vehicle images are "pure" images without complex backgrounds. Recently, Yang et al. (18) proposed the use of a convolution neural network for fine-grained vehicle classification and then regression of parameters. Krause et al. (19) also used the R-CNN method and combined the joint segmentation and automatic parts localization method to solve the problem of some components being unmarked.
In fact, vehicles are rigid and have their own characteristics, such as a unified structure. Each vehicle is composed of several fixed types of components, and the relative position between components is fixed. In addition, vehicles have symmetrical characteristics that can be applied to fine-grained classification models. Therefore, designing a fine-grained classification model applicable to vehicles by studying the existing fine-grained classification model and combining the characteristics of vehicles is theoretically and practically supported. So far, no fine-grained vehicle classification methods are applicable to practical transportation. Therefore, it is necessary to study this topic. We attempt to solve the problem of fine-grained vehicle classification through deep learning.

Analytic Structure
The definition of fine-grained vehicle classification is to classify vehicles by brand, car series, model, and year, such as identifying a vehicle as a "Buick-Weilang-Sedan-2018." Because classification is carried out within the categories of segmentation, the differences between objects are often very small and most of the time fine class differences within the larger than fine class differences. Thus, to achieve fine-grained classification, the core consideration is to find the significant characteristics of a small class, and then to identify the rich semantic components. (20)(21)(22)(23) The problem of fine-grained vehicle classification is to identify the various components of the vehicle and distinguish them in accordance with their differences. Therefore, as long as the component detector of different components of the vehicle can be trained, the positions of the different components of the vehicle and the confidence of the vehicle model of each component can be detected using the input image, and then the result of each component detector can be combined to determine which fine-grained classification the vehicle image is most likely to belong to. Finally, the fine-grained vehicle classification problem with high similarity can be transformed into a vehicle component classification problem with a large difference.

Vehicle component detection model
A vehicle is a rigid body with a fixed structure that can be divided into 13 components: ceiling, headlight, inlet gate, hood, front windshield, tail lights, rear windshield, rearview mirror, front side door, back side door, trunk, front logo, and back logo. The 13 different components of the vehicle require 13 different components to be trained. When designing different component detection models, the improved Faster R-CNN model is still adopted.
Taking the headlamp detection model as an example, it is processed in the following order: vehicle image, deep learning model of headlight detection, headlight subgraph, as seen in Fig. 1.
The model shown in Fig. 1 can detect the position of vehicle components from the vehicle image. Each part of the vehicle is represented by a rectangular area. Owing to the different sizes of each part, the proportion of the area in the whole vehicle image is different. Therefore, different components adopt different anchor sizes in the candidate area extraction network  Table 1 shows the average percentage of rectangular areas representing different components in the whole vehicle rectangular area and the ratio of length to width of rectangular areas representing each part. The purpose of the statistics on the average percentage of vehicle area and ratio of different components is to optimize the anchor settings in the RPN, so that the RPN can scan the whole image in accordance with these statistical data when extracting candidate areas, and reduce invalid candidate areas. Only in this way can the RPN achieve the best detection accuracy and speed.
After a component is detected by a component detector, it is necessary to judge to which part the vehicle component belongs. For example, when a headlight is detected, it is necessary to further determine the type of vehicle and its corresponding confidence level. When the production of polyhydroxyalkanoates (PHAs) (24)  Those smaller than the average value are denoted as 0, and those greater than or equal to the average value are denoted as 1. (5) Combine the results of the comparison in step (4) into a 64-bit string in order from top to bottom, and left to right; this 64-bit string is the fingerprint of the image. (6) Calculate the "hamming distance" of two image fingerprints, i.e., the number of the same characters in the same position, and divide it into 64 to obtain the similarity of the two images. The final results of the headlight test described in Fig. 1 indicate that the big lamp belongs to a Buick Weilang Sedan 2018, and the confidence level is 0.9.

Fine-grained vehicle classification model based on fusion of multi-CNNs
The For the quick and efficient detection of 13 vehicle components using the CNN, the input image used an improved image of a Faster R-CNN, which represents the vast majority of vehicle after removing background graphics. A test image of a Faster R-CNN after detection is taken, as shown in Fig. 2. The whole vehicle image is input into 13 vehicle part detectors to obtain 13 confidence detection results. The confidence data obtained from the voting for these five vehicles is shown in Table 2. As can be seen from the voting results in Table 2, the total votes for the 5 types of vehicles are, respectively, 6.7, 0.6, 0.7, 1.5, and 0.8, and the maximum voting totals of 6.7 are taken as the fine-grained classification of the image. Therefore, in this case, the model image belonging to a fine-grained classification is a Toyota Corolla Sedan 2017 model. Using Eq. (1), the corresponding confidence level is found to be 6.7/8 = 0.8375.

Experiment
Because of the thousands of vehicles that have been seen so far, we need a huge amount of data to be able to identify the 13 vehicle components, and in this experiment, we apply a series of five models for training: Buick Regal Sedan 2017, Buick Weilang Sedan 2015, Toyota Corolla Sedan 2017, Toyota Camry Sedan 2018, and Porsche Macan SUV 2017.
For each of the components of the vehicle, about 2000 images were collected at an average of about 400 images per model, with 300 images being used to train the component detectors and 100 images used as verification sets. In the fine-grained classification test stage, 96 images of the above 5 models and 4 images of the Zhongtai SR9 were input for classification testing, and the accuracy was 68%. We identified the correct examples, as shown in Fig. 3.
As seen in Fig. 3, the fine-grained classification model successfully identified the Toyota Corolla and Camry and the Buick Regal and Weilang. During the experiment, some typical errors arose, as shown in Fig. 4.
In the results of the experiment shown in Fig. 4, the left image correctly classifies the Porsche Macan, but the right image mistakenly classifies the Zhongtai SR9 as a Porsche Macan. The Zhongtai SR9 imitates the appearance of the Porsche Macan, with only some minor details, and the resolution of the image is not sufficiently high. Therefore, these details are difficult to learn, thus leading to a system classification error.

Improvement
Although this method can classify vehicles with 68% accuracy, it is still far from being a practical application owing to the following problems.
(1) So far, the experiment has only been conducted for five models. The rate of change as new models are added has not yet been established. (2) The categorization is too slow. The average speed of classifying 100 images is three times per second, which is far from the requirement of real-time performance. The main reason is that 13 independent CNNs are used in this study for component detection. The operations between different CNNs are not shared, and there are many repeated calculations. To further verify the classification performance of the fine-grained classification model, experiments were carried out using the CompCars (25) data set. Since the images of vehicle components in CompCars only include headlights, tail lights, and air intake gates, only a three-component detection network is allowed. The fine-grained classification results of the experiment are shown in Table 3.
It can be seen from Table 3 that the comparison of the Top-1 and Top-5 classification accuracies of the two models indicates that the fine-grained classification model adopted in this study is superior to the CompCars' classification model. It also expands upon CompCars. Since CompCars does not include the images of vehicle ceilings, hoods, front windshields, rear windshields, rearview mirrors, front side door, back side door, trunks, front logo, and back logo, we added 21,458 images to the CompCars data set, and then used the 13 component detectors in the fine-grained classification model herein to conduct our experiment. The classification results are shown in Table 4.
It can be seen from Table 4 that after the expansion of the CompCars' data set, the finegrained classification effect of the CompCars' original classification model on the new data set changed minimally (both top-1 and top-5 have reduced accuracies). However, the model in this work has greatly improved classification accuracies of top-1 and top-5. The main reason is that the increased number of component detectors in the fine-grained classification method result in the comparison of more details.

Conclusions
In this work, we studied fine-grained vehicle classification technology based on the fusion of multi-CNNs. For vehicle images with complex backgrounds, the model first detects the vehicle area and then inputs the area into the fine-grained classification model for classification. This method filters the input of the fine-grained classification model, reduces the noise interference, and significantly improves the accuracy and speed of fine-grained classification.
We proved that the fusion of multi-CNNs can achieve fine-grained vehicle classification. The proposed method divides the vehicles into 13 components, trains one detector for each part, and then votes in accordance with the test results of the 13 components to classify the input image. The experimental results show that this method is effective, but the classification speed has yet to be improved. It is hoped that this study will provide a reference for the application of the fine-grained vehicle classification technology based on the fusion of multi-CNNs.