Application of Convolutional Neural Network (CNN)–AdaBoost Algorithm in Pedestrian Detection

Pedestrian detection based on vision sensors is a hot and difficult issue in the field of autonomous driving. The large amount of data processing leads to high requirements for the robustness and real-time performance of the employed algorithm. The aggregate channel feature (ACF) algorithm is one of the widely recognized fast pedestrian detection algorithms, but there are many missed detections when the target is occluded or small. In response to this problem, we propose a pedestrian detection algorithm based on a combination of a fivelayer convolutional neural network structure and an AdaBoost classifier (CNN–AdaBoost). The model was trained using Caltech and INRIA datasets, and detection experiments were performed using collected videos. The results show that the error detection rate of the proposed algorithm is greatly reduced compared with that of the ACF algorithm, but the detection speed is basically unchanged. Compared with the locally decorrelated channel features (LDCF) algorithm, the proposed algorithm achieves similar detection accuracy but the detection efficiency is greatly improved.


Introduction
Vision sensors can provide high-resolution color information, which can more accurately reflect the details of complex changes in light. Therefore, pedestrian detection based on vision sensors has wide application in many fields such as the military, traffic, and security fields. Because a pedestrian has the characteristics of scale, motion, and pose variation, and the appearance is easily influenced by factors, such as clothes, sunlight, shielding, and viewing angle, pedestrian detection is a difficult and hot issue with major challenges.
The key factor restricting the application of pedestrian detection methods in intelligent driving is the large amount of data processing, leading to high requirements for the robustness and real-time performance of the employed algorithm. Currently, the basic pedestrian detection methods can be divided into two categories from the perspective of the feature acquisition method: (1,2) one is the traditional machine learning method based on artificial features, and the other is the deep learning method based on convolutional neural network (CNN) features. The basic framework of traditional machine learning methods includes feature extraction and classifiers. Features here mainly include the histogram of the oriented gradient (HOG), (3) local binary pattern (LBP), (4) deformable part model (DPM), (5) and aggregate channel feature (ACF). (6) Classifiers include the support vector machine (SVM), decision tree (DT), random forest (RF), and AdaBoost. The basic framework of deep learning methods includes a deep CNN and a classifier, which uses the deep CNN for feature extraction, and in typical structures such as GoogleNet, ZFNet, AlexNet, VGGNet, and ResNet, the classifier is generally an ordinary fully connected neural network. R-CNN, YOLO, and other deep learning detection frameworks have better pedestrian detection performance than traditional machine learning, (7,8) but the training of their models requires hardware with high computing power and massive datasets. The training is time-consuming, and it is difficult to perform training tasks using ordinary PCs. Morevoer, large datasets are not easy to obtain. Owing to the lack of a theoretical foundation, the design of a network's hyperparameters is also a considerable challenge. For target detection with a small dataset, traditional machine learning methods are usually better than deep learning. The ACF algorithm proposed by Dollar et al. is one of the widely recognized fast pedestrian detection algorithms. (6) The ACF algorithm is based on integral channel features (ICF), (9) and an AdaBoost classifier composed of 2048 two-layer DTs is used in the algorithm. The locally decorrelated channel features (LDCF) algorithm is based on the ACF algorithm and uses linear discriminant analysis (LDA) to obtain the final LDCF features. (10) The weak classifier used is a DT with a depth of five layers, and the total number of cascaded weak classifiers is 4096. The missed detection rate of the LDCF algorithm tested on the Caltech dataset reached 29.8%, about 16.2% less than that of the ACF algorithm. However, its missed detection rate was still large, especially when there were small or occluded pedestrian targets in the test. Ma and Gao proposed a combination of the LDCF algorithm and a CNN, (11) with the LDCF algorithm used to obtain region proposals, then the CNN used to extract features, and an SVM used to classify the extracted features. Zhang et al. used a region proposal network (RPN) in a faster R-CNN to extract region proposals, (12) and then used boosted forests to classify features. Mao et al. added a VGG-16 network on the front end of a faster R-CNN to obtain additional channel features. (13) Ouyang and Wang proposed the joint deep method, (14) which uses an SVM as the first-level detector and a CNN to further determine its detection results. The above method uses a deep CNN to improve the detection accuracy, but also requires a large number of datasets and longterm training. The real-time performance of algorithms also requires advanced hardware support.
On the basis of the above research, the combination of a deep CNN and traditional machine learning to improve the performance of pedestrian detection is currently the most popular technical route. However, a problem with this approach is how to effectively reduce the depth of the CNN while improving the detection accuracy in order to reduce the dependence of the algorithm on the dataset and hardware. In response to this problem, we propose a pedestrian detection method (CNN-AdaBoost) based on an AdaBoost classifier combined with a CNN feature extractor. First, we refer to the fast R-CNN framework to improve the detection efficiency. (15) In view of the high miss rate of the AdaBoost classifier in the ACF algorithm, we propose a negative sample retrieval strategy to improve it. Second, we design a five-layer CNN, which is used as a feature extractor to improve the detection rate of small pedestrian targets. The rest of the paper is organized as follows. In Sect. 2.1, we introduce the overall framework, detection, and training process, and then in Sect. 2.2, we introduce the negative sample retrieval strategy and the structure of the five-layer CNN. The experimental results are given in Sect. 3. Conclusions are given in Sect. 4.

Basic framework of the algorithm
The overall architecture of the CNN-AdaBoost algorithm proposed in this paper is shown in Fig. 1. It mainly includes four parts: a fast feature pyramid part, a region proposal selection part, a CNN feature extraction part, and a feature processing part.
In the detection phase, a color image is first calculated through a fast feature pyramid with multiscale AFC features. The region proposal section in the upper branch of Fig. 1 uses a fixedsize sliding window (the red rectangular frame on the fast feature pyramids picture in Fig. 1) to extract the ACF features layer by layer from the bottom to the top of the feature pyramid. The ACF features are expanded into a feature vector, and the target and non-targets are filtered step by step through an AdaBoost classifier. For the non-targets, we use a negative sample retrieval strategy to make the position of each non-target a candidate region again. This will not affect the overall detection efficiency of the AdaBoost classifier and, at the same time, it can effectively reduce its false detection rate. The CNN feature extraction part in the lower branch of Fig. 1 extracts the L color space features (the green rectangular frame in the fast feature pyramids picture in Fig. 1) of the LUV color space layer by layer from the bottom to the top of the pyramid, and the features extracted by the CNN have better expressive power for small targets than ACF features. The task of region of interest (ROI) feature extraction is to obtain the data of the feature map in the corresponding position of the proposal region, and finally, the feature is classified by the fully connected layer and Softmax. In the training phase, AdaBoost and the CNN are trained separately using conventional methods. AdaBoost is trained using ACF features, and the CNN is trained using L color space features.

Negative sample retrieval strategy and CNN structure
Ohn-Bar and Trivedi showed that AdaBoost is different from the CNN, and it is difficult to further improve its detection performance by increasing the size of the trained pedestrian dataset and the depth of the DT. (16) To maintain the depth of the ACF algorithm's DT, a negative sample retrieval strategy is proposed in this paper. The basic idea of the strategy is to reselect the regions that have been detected as negative samples as proposal regions using the AdaBoost classifier. Specifically, a scale threshold and a rank threshold are set. The scale threshold is set according to the number of layers of the fast feature pyramid. A value smaller than the threshold indicates that the scale is close to the top of the pyramid, and the rank threshold is set according to the number of cascaded AdaBoost strong classifiers. A value greater than the threshold indicates that the rank is near the end classifier. When the scale of pyramid layers that the sliding window is on is less than the scale threshold and the rank of activated strong classifiers is greater than the rank threshold, if the result of the strong classifier is a nonpedestrian target, the sliding window position of the non-pedestrians is used as the proposal region.
For pedestrian detection using on-board cameras, we hope to find a CNN with a simple structure, an easy-to-use small scale for training on ordinary PCs, and a detection speed that can meet real-time requirements. Here, we employ a five-layer CNN, where the size of the convolution kernel is 9 × 9 and the neighborhood of the maximum pooling method is 2 × 2. By adjusting the number of convolution kernels in the third layer, we obtain four different CNNs, After the preliminary training, a test experiment is carried out. The training mean square error curve of each CNN is shown in Fig. 2 Table 1. It can be seen that CNN2 has the highest speed and accuracy, so CNN2 is subsequently used in this study.

Results and Discussion
The experimental hardware platform is an Intel (R) Core (TM) i3-2370M CPU (2.4 GHz, 6 GB RAM) and the experimental software platform is the Windows 7 operating system with MATLAB R2015b. The color vehicular camera used has a frame rate of 24 fps and a resolution of 640 × 480. The experimental training test data is shown in Table 2. The training of the AdaBoost classifier uses the Caltech training set, and the training of the CNN uses a combination of the Caltech training set and the INRIA training set ( Table 2). The test set uses the INRIA test set, Caltech test set, and collected videos. The video test set we used is obtained on the campus through a color vehicular camera. The ACF and LDCF algorithms are used for comparison. Figure 3 shows the miss rate-false positives per image (FPPI) curves of the Caltech test set and INRIA test set. Figures 4-6 show the test results for different scenes in the video test set. The left column is the detection result of the ACF algorithm, the middle column is the detection result of the LDCF algorithm, and the right column is the detection result of the proposed algorithm. It can be seen from Fig. 3 that the CNN-AdaBoost algorithm performs better on the Caltech dataset than on the INRIA dataset, and its detection performance for the two datasets is generally better than those of the ACF and LDCF algorithms.
In Fig. 4, the pedestrians have different sizes. For large pedestrian targets, the three methods can correctly detect larger pedestrians, but the ACF algorithm misses the detection of small targets. The detection of small targets by the LDCF algorithm is improved compared with the ACF algorithm, but when the small targets are closer, they are sometimes not detected. However, the proposed method can still distinguish different targets in this case. In Fig. 5, there are mutually occluded targets. The ACF and LDCF algorithms can generally detect large occluded targets, but in the case of small occluded targets, missed detection and false detection occur. However, the proposed method can still detect the occluded targets in these cases. There are deformed targets in Fig. 6. Although all three methods can correctly detect deformed targets at a short distance, the ACF and LDCF algorithms generally fail to detect small deformed targets, while the proposed method can still detect them.
The missed detection rate, the average number of false frame detections, and the detection efficiency are evaluated for the video set, and the results are shown in Table 3. The ACF Conv layer (convolution kernel) 3rd-order tensor C: 6 × 9 × 9 Layer 2 Pooling layer (maximum pooling) matrix P: 2 × 2 Layer 3 C: 5 × 9 × 9 C: 12 × 9 × 9 C: 18 × 9 × 9 C: 27 × 9 × 9 Layer 4 P:      algorithm has the highest detection speed, but the false detection rate is highest, with the average number of false detections per frame reaching 3.784. The LDCF algorithm has fewest false detections, but the missed detection rate is higher than that of the proposed method and its detection efficiency is low. The average detection time per frame is 0.1255 s. The proposed CNN-AdaBoost algorithm has a lower missed detection rate than the ACF algorithm while maintaining a higher detection speed. The average detection time per frame is 0.0809 s, which is close to the speed of the ACF algorithm. From the experimental results, it is concluded that, in the proposed CNN-AdaBoost algorithm, the AdaBoost and CNN computations remain independent and parallel. The CNN increases the algorithm complexity compared with those of the ACF and LDCF algorithms, but we optimized the structure of the five-layer CNN by simplifying its input data (using the fast pyramid algorithm in the L color space) and performing feature extraction (using the sliding window method of AdaBoost to obtain candidate regions and obtaining feature vectors from the corresponding CNN output). These improvements enable the real-time performance of the algorithm. Because the probability of misclassification of AdaBoost's strong classifier is high when dealing with features near the top of the pyramid, the negative sample retrieval strategy is used to feed such false positives to the CNN to reidentify them, which overcomes the bottleneck due to the strategies used to improve the AdaBoost classifier performance (increasing the size of the trained pedestrian dataset and increasing the depth of the DT). This also makes the algorithm more robust to complex conditions such as occlusion and deformation.

Conclusions
In this paper, an AdaBoost classifier is combined with a CNN to realize a novel pedestrian detection method (CNN-AdaBoost). This method uses the fast feature pyramid of the ACF algorithm to calculate the features of each channel. In each layer of pyramid features, the CNN only extracts features of the L color space. Each proposal region is obtained through a fixedsize sliding window in AdaBoost combined with a negative sample retrieval strategy. This avoids the shortcoming of the CNN sliding window of a low efficiency of feature extraction. At the same time, it retains the advantage of the AdaBoost classifier of high efficiency and that of the CNN of strong classification performance. The experimental results show that the method has superior efficiency and accuracy to the ACF and LDCF algorithms in detecting small pedestrian targets, which illustrates the effectiveness of the CNN-AdaBoost method. The proposed method uses a phased training method, which increases the burden of the model training phase, so further methods for improving the method will be explored in future research.