Forward Collision Warning and Lane-mark Recognition Systems Based on Deep Learning

In this study, a driver assistance system that uses a network model based on deep learning technology was developed. It has forward collision warning and lane-mark recognition features. The application uses a webcam to capture forward images, which are transferred to a computer in which object recognition has been implemented. The system information is displayed on smart glasses through the network as an augmented reality image. You Only Look Once (YOLO) real-time object detection (tiny YOLOv2) was used as the main architecture to reduce the network complexity and enhance computing efficiency. During the training process, K-means was used to select the anchor box from each dataset. This enabled the size of the predicted box to be determined as a reference to enhance efficiency. This system makes it possible for the driver of a vehicle to learn about the movements and positions of vehicles ahead with respect to distance and lane marks. This reduces the chance of collisions as well as the violations of traffic regulations and improves driving safety.


Introduction
According to the 2018 World Health Organization (WHO) global road safety report, the number of deaths from road traffic accidents continues to rise. Around 1.35 million people die each year from traffic accidents, and the report emphasizes that road accidents are the main killer of children and adolescents. (1) One of the main causes of road accidents arises from drivers paying too little attention to the vehicles on the road in front of them. This may happen because the driver has been distracted or visibility is poor. The problems may also be the fault of the drivers themselves. A vehicle, pedestrian, or even an obstruction may suddenly appear, the driver reacts too slowly, or is unable to take corrective action soon enough, and a collision results. The most common dangerous encounter is with another automobile, a pedestrian, or a motorcycle. Whether it be the driver's fault, or that of a pedestrian or another vehicle, these are all key factors contributing to the sudden appearance of an object in front of a moving vehicle and the cause of accidents. Accidents often result from a violation of regulations such as noncompliance with a traffic sign or traffic lights. Although many accidents result from intentional violations by drivers, many are the result of the failure of the driver to notice a traffic sign or traffic lights. The problem may be environmental (bad weather) or the result of poor design and inadequate traffic control. In any case, paying too little attention to the road ahead, to traffic signs, and particularly to the vehicle directly in front is dangerous.
The development of technology for autonomous vehicle control has been rapid, and according to the Boston Consulting Group (BCG), (2) the size of the global autonomous vehicle market will reach 42 billion US dollars by 2025. The sale of automated vehicles will account for 12.4% of the overall market, and the market scale will double by 2035. At present, the definition of autonomous vehicles in the industry generally complies with the J3016 standard of the Society of Automotive Engineers (SAE). This standard has six levels (0 to 5) depending on the degree of vehicle automation. Autonomous vehicles rely mainly on technologies such as the advanced driver assistance system (ADAS) and the Internet of Vehicles (IoV). For ADAS, forward collision warning (FCW) is the key to solving the problem of a lack of attention to the road ahead and the vehicle in front. Also, road sign recognition (RSR) is a key to the resolution of violations of traffic regulations. However, most research is concerned with standing road signs and traffic lights, and signs on the ground are neglected. Often, too little attention is paid to zebra crossings and yellow box junctions. Failure, for any reason, to observe the instructions given by road-level signs can cause serious accidents that harm both the driver at fault and others on the road.
Early data processing algorithms had drawbacks, namely, features needed to be extracted manually and the computers available could not handle the large amounts of data involved. However, the graphics processing units (GPUs) now freely available, as well as modern fast multicore processors, have raised computing efficiency by orders of magnitude. Artificial intelligence (AI) is now booming and deep learning (DL) has been rejuvenated and is now the most popular AI technology and a serious market focus.
Visual technology is another focus of the entire technology circle and has been a leading trend for years. Major manufacturers have launched augmented reality (AR), virtual reality (VR), and mixed reality (MR) products, and these new technologies have been widely applied in various fields. Spectacular and attractive products, often with special hardware, are available to the public. The market is growing rapidly. According to a forecast by Digi-Capital, (3) the AR market will reach 70 to 75 billion US dollars in 2023, and the VR market will reach 10 to 15 billion US dollars. AR technology has huge business potential.

Related Work
This paper has four parts: DL, object detection, FCW, and lane-mark recognition.

DL
DL is a branch of machine learning (ML). Its algorithms use an artificial neural network (ANN) architecture, which inputs data to implement feature learning. The concept of an ANN can be traced back to 1943, when a paper by neuroscientist Warren S. McCulloch and mathematician Walter Pitts appeared in the Bulletin of Mathematical Biology. (4) In this paper, the concept of an ANN was proposed, as well as a mathematical model for artificial neurons. This started the era of research into neural networks, and in 1958, Rosenblatt released the Perceptron, (5) the first ANN model, and laid a foundation for serious ANN research. In 1974, Werbos proposed backpropagation (BP) (6) to solve the problem of mutual exclusion, which was beyond the capability of basic sensors. In 1986, Rumelhart et al. provided a more comprehensive description of BP (7) that caused a boom in ANN research. However, a few years later, another BP problem was revealed, the vanishing gradient. This caused stagnation in ANN research until 2006, when Hinton, the father of DL, proposed the restricted Boltzmann machine (RBM) (8) and the deep belief network (DBN). (9) This solved the vanishing gradient problem, and the deep neural network (DNN) became DL. However, the computing process was run on central processing units (CPUs) at that time. The huge DL calculations strained the capabilities of CPUs, and once again, there was a lull in progress. However, in 2012, at the ImageNet image recognition competition, two of Hinton's students used a GPU plus a deep convolutional neural network (DCNN) (10) to win the championship. At that time, GPU operation speed had reached more than 70% of that of CPUs. After this competition, DL became one of the hottest current technologies.

Object detection
There are two types of object detection algorithm: traditional and DL algorithms. The traditional algorithm story began in 2001 when Viola and Jones wrote a thesis regarding object detection, (11) which combined three algorithms for facial recognition; they used integral image, adaboost, and the cascade classifier to achieve very good recognition. There are two approaches to DL algorithms. One approach is R-CNN, (12) proposed by Girshick et al. In this approach, algorithms first generate the candidate regions and then classify them. Although these algorithms have high accuracy, they are slow. The other approach includes You Only Look Once (YOLO) (13) proposed by Redmon et al.,and SSD (14) offered by Liu et al. These algorithms predict the location and category probability of the object directly, and although the accuracy is lower, they are much faster than the R-CNN series.

FCW
FCW is an indispensable core technology used in devices for FCW. These include cameras and ultrasonic, radar, and optical sensors. Cameras provide rich image data and can allow the category of a forward object to be identified. This study concentrates on image-based FCW, and many other studies have been made on this aspect in recent years. Among these, Song et al. (15) used stereo imaging to detect objects in forward vision and also employed UV disparity for image segmentation in the implementation of a FCW system. Mukhtar et al. (16) selected the sensor first, followed by the detection and tracking of vehicles, and finally provided the best option for a collision avoidance system.

Lane-mark recognition
Some research was done on lane-mark recognition by Gupta and Choudhary, (17) and Mathibela et al. (18) used a single camera to first select a region of interest (ROI) image, and then used grayscale and smoothing before foreground detection and segmentation. The connected components of the processed image were then found using principal component analysis (PCA), and classification was done using spatio-temporal incremental clustering (STIC) to check if it was a traffic lane line or a ground sign. This was followed by graph embedding grassmann discriminant analysis (GGDA) to recognize ground-level signs. Mathibela et al. (18) focused on the recognition of line-type ground signs. They first divided the signs into seven categories, then used inverse perspective mapping (IPM) to convert the input images into aerial view images. The node position of each image was then determined and used to classify the images. Final classification of the extracted ground road signs was done by random under sampling boost (RUSBoost) and conditional random field (CRF) algorithms.

Methodology
This section describes Anchor box selection, Tiny YOLO network, and FCW and lane-mark recognition. The system flow chart is shown in Fig. 1.

Anchor box selection
When YOLO was first used, the predicted box was calculated using the results of neural network prediction directly as the box value. However, prediction using this method was inefficient and in YOLOv2, the anchor box was used, mainly to provide a reference of the size of the predicted box, which improved calculation efficiency. In this study the K-means algorithm was used for the selection of the anchor box and to classify the required number and size of the anchor boxes from all the ground truth boxes. In addition, since the only anchor box values needed are width and height, the value of the center coordinate is not calculated. The center points of all ground truth boxes are then set and consistent for classification. The steps are shown in the classification flow chart in Fig. 2.

Tiny YOLO network
(1) Network architecture: The network architecture used in this study was based on that of Tiny YOLO. Tiny YOLO was selected to reduce network complexity and enhance computing efficiency, as well as to allow real-time computation. The architecture has nine convolutional layers and six pooling layers, and adds batch normalization (BN) (19) in the first eight sets of the convolutional layer and uses Leaky ReLU (LReLU) (20) as the activation layer. Figure 3 shows the architecture of the Tiny YOLO network. (2) YOLOv2 principle: The main idea of YOLOv2 is to cut a picture of equal width and height into an S × S grid, predict B pieces of predicted boxes in each grid, and then predict the center coordinates, the size, the confidence, and the probability of C categories (see Fig. 4).
Equation (1) can be obtained from the above, and the final predicted value can be obtained from Eq. (1). S = 13, B = 5, and C = 6 are used as the settings for the demonstration.
( ) The confidence is defined in Eq. (2), where Pr(Object) determines whether the predicted box contains the object: if there is no object, Pr(Object) = 0; if there is an object, Pr(Object) = 1. Since the network used in this study was the Tiny YOLO, the accuracy is relatively low compared with other networks; therefore, for the experimental results, the threshold of confidence was set to 0.45.  Figure 5 shows a demonstration diagram of the IOU. The green frame is the ground truth box, the blue frame is the predicted box, the area enclosed by the red line is the union, and the area enclosed by the purple line is the intersection.
(3) Loss Function: The main purpose of the Loss Function is to evaluate the difference between the predicted value and the actual value. It is expected that the Loss Function will approach 0 at the end to enable the neural network to have a good predictive effect. The mean square error (MSE) method was used to make the difference between the predicted value and the actual value of the YOLO loss function positive and avoid possible positive and negative value offset.
The Bounding Box Loss Function was divided into four parts: the central coordinate loss, the size loss, the confidence loss, and the category probability loss. Among these, the confidence of the bounding box was also divided into those with an object and those with no object, as shown in Eq. (4).

FCW and lane-mark recognition
(1) Object recognition: The neural network obtains the predicted box by calculation, but its central coordinates, size, and confidence are not extracted directly from the results. If the results of the prediction are used directly as settings for the predicted box, effective predictions cannot be properly calculated. There are four parts (settings) to the predicted box: the central coordinates, the size, the confidence, and the category probability. A predicted box diagram is shown in Fig. 6.
The first part is the central coordinates of the predicted box.
( ) In Eqs. (5) and (6), box p_x and box p_ y are the X and Y coordinates of the center of the predicted box, respectively. t x and t y are the predicted values of the X and Y coordinates of the center of the predicted box, respectively. c x and c y are the X and Y coordinates of the upper left of the related block of the current predicted box, respectively. A Sigmoid function is used to limit t x and t y between 0 and 1 to prevent the coordinates in the box moving to other boxes.
The second part is the size of the predicted box.
In Eqs. (7) and (8), box p_w and box p_h are the width and height of the predicted box, box a_w and box a_h are the width and height of the anchor box, and t w and t h are the predicted width and height of the predicted box, respectively. An exponential function is used to implement scaling to avoid the value being too high. The predicted value is limited to between −0.2 and 0.2 via a Gaussian distribution. In Eq. (9), box p_o is the confidence of the predicted box and t o is its confidence predicted value.
The last part is the category probability of the predicted box. In Eq. (10), box p_c is the category probability of the predicted box and t c is its predicted value.
(2) FCW decision: In addition to object recognition, this system includes FCW to alert the driver to the approach of an object on the road ahead. YOLOv2 can find an object and its position in a picture. The distance between the object's bounding box and the bottom edge of the picture is an indication of how close the object is to the camera. The smaller this distance, the closer the object. When the bounding box is close to the top of the picture, the object is far away. A decision model can be established using this feature as follows.
In Eq. (11), object btm_ y is the Y-axis coordinate at the bottom of the predicted box for the forward object and image h is the height of the picture. When the forward object exceeds the set value, the system will warn the driver. Since this study was carried out in an urban area, the average speed was about 40 to 50 km/h, and a warning was expected when the distance between the object and the camera was 3 to 5 m. The set value used was 75% of the total image height. A schematic diagram of the object collision warning system is shown in Fig. 7. The yellow part is the warning area, and when the image of the object enters this area, the system will send a warning.  Figure 8 shows a hardware architecture diagram, which can be divided into three parts: an input device, a computing device, and a display device. The input device used in this study was a Logitech C525 HD webcam, the computer was a D830MT computing platform with a GTX 1080 Ti GPU for DL, and the AR display device used was BT-350 smart glasses.

Dataset
The dataset used for training in this study was a record of images from the driving recorder. Each image was 1920 × 1080 pixels. Three sets of data were used: a forward object, a night forward object, and a ground road sign. The detailed information is shown in Table 1.

Anchor box setting
K-means was used to calculate three sets of concentrated anchor box data. The results were scaled to 13 × 13 pixels, so that the size of the anchor box matched that of the last layer of the network. The experimental results are shown in Table 2, and the schematics of the prior anchor are shown in Figs. 9-11.

Network model
Three network models based on the Tiny YOLO network were built: a forward object network, a night forward object network, and a ground road sign network. For the Tiny YOLO network, preprocessing is necessary to resize the input image to 416 × 416 pixels. The network model parameters are shown in Table 3, where Conv9 (1) is the output layer of the forward object network, Conv9 (2) is the output layer of the night forward object network, and Conv9 (3) is the output layer of the ground road network.

Training process
Most of the network training was based on the loss function as the end of training. When the loss function reached a certain value, or when the loss curve flattened, the network training stopped. The loss curves of the three groups of network used in this study are shown in Fig. 12.

Recognition results and AR
The system speed used in the FCW and lane-mark recognition experiments was 20 fps. The results of the daytime FCW, nighttime FCW, and lane-mark recognition experiments are respectively shown in Figs. 13-15. In this study, the training dataset and trained network model were used to implement mAP-related experiments, and the experimental results are shown in Table 4.      The results from the computer calculations were transmitted to the network via Screenleap and connected to the page with the smart glasses to obtain the current driving image and data from the two systems (Fig. 16).

Conclusion
The system proposed in this study has two parts with different functions: one is a FCW system and the other is for lane-mark recognition. In these experiments, the data used for training for both forward object and ground RSR were 10 anchor box datasets separately determined by K-means. The two sets of anchor boxes and datasets of the two systems were used to train the forward object network and lane-mark network models. A third forward object network model was trained for application at night. The anchor box was set to 6, and this improved the performance of nighttime recognition. A camera was used to capture the image of the vehicle ahead and was used to predict the type and position of the forward object and the ground road signs using the three trained network models. The FCW system issues a driver alert if an object is determined to be too close to the camera. The results of the FCW and lane-mark recognition systems were transmitted to BT-350 smart glasses through the Internet by an implementation of AR technology; this gave the driver a view of the road ahead with information about possible hazards and warnings of danger to enhance safety.