Object Detection of Road Facilities Using YOLOv3 for High-definition Map Updates

Autonomous driving technology is significantly based on the fusion of high-definition (HD) maps and sensors. Therefore, the construction and update of HD maps must be emphasized to achieve full driving automation. Herein, a method is proposed to detect road facilities using object detection with images, particularly for HD map updates utilizing the You Only Look Once version 3 (YOLOv3) algorithm. The proposed approach, a deep-learning-based object detection method, utilizes transfer learning, which can detect objects in road facilities and record road sections that require maintenance. To test the effectiveness of the detection method, we analyze video footage captured in the Korean road environment. The experimental results show that this method achieves a mean average precision (mAP) of 58 and can update HD maps using a crowdsourcing framework.


Introduction
Autonomous driving technology is currently being developed owing to industry demand for a more robust detection system to ensure safety during driving. An example of such a detection system is the high-definition (HD) map. (1) HD maps contain a 3D layout of environmental information regarding roads in advance, afforded by the use of a driving vehicle equipped with a mobile mapping system (MMS) that includes sensors such as an inertial navigation system, radio detection and ranging sensors, light detection and ranging (LiDAR) sensors, cameras, and global navigation satellite systems. In addition, HD maps further enhance the features of the detection system. In complex road environments, HD maps enable one to recognize road facilities (e.g., road signs or traffic lights) and be aware of a vehicle's surroundings. However, most HD map construction processes are currently performed manually, which is both costly and time-consuming. Hence, methods to achieve automatic HD map construction are being investigated actively. In particular, a system that automatically detects any changes on roads and corrects them synchronously on a map is highly recommended. Research pertaining to HD map updating focuses on real-time systems that use a crowdsourcing framework. (2)(3)(4)(5) Although a significant amount of time and money is required to complete HD maps, updates to reflect volatile road conditions in real time are necessary to ensure safety during autonomous driving. This framework can be categorized into two stages: the stage of recognizing objects to identify changes and the stage of updating HD map features via an Internet server. This study aims to improve the initial detection stage of identifying changes through object recognition and realize an HD map updating system for autonomous driving in the Republic of Korea. In particular, the proposed method explicitly addresses Korean road facilities based on the road and traffic signs of the Republic of Korea. Methodologies for developing an efficient change detection system in a road environment are tested in this study. Notably, the You Only Look Once version 3 (YOLOv3) algorithm was utilized via transfer learning to detect traffic signs, road signs, and traffic lights in a video featuring a road environment in real time.
In Sect. 2, the features of our approach are reviewed on the basis of previous relevant studies. Section 3 presents the experimental data and the proposed method. Section 4 presents and discusses the results of the study. Finally, Sect. 5 provides conclusions, including suggestions for future research.

Background
HD maps provide road environment information and point of interest information, such as road alignment, lane classification, and road signs required for autonomous driving. To update these maps promptly and periodically, studies pertaining to the automatic detection of object changes are currently being conducted using LiDAR data and camera image data from sensor information to perform partial corrections. Road facility object detection using LiDAR data has been investigated extensively. (6)(7)(8)(9)(10) Hata and Wolf detected lanes by categorizing LiDAR point data for lanes and asphalt, (6) and Jo et al. attempted to identify whether a traffic sign has disappeared or has been added. (7) Ma et al. increased the accuracy of detecting and classifying road markings by applying a deep learning framework. (8) Pannen et al. constructed a framework that recognizes changes by detecting lanes and immediately providing a crowdsourced updated HD map. (9) Kim et al. used a point unit to determine whether shape change has occurred as well as to apply the change immediately; however, they did not specify the changed object. (10) Although using LiDAR data in such a manner provides outstanding accuracy in identifying the shape of an object, to maintain the up-to-dateness of the map, many vehicles equipped with MMS equipment in addition to LiDAR sensors are required to observe all roads in a wide area. However, the use of LiDAR equipment is costly and, therefore, not optimal for updates. By contrast, because modern cameras offer a relatively high resolution, the footage captured through mobile devices can be easily used in image detection systems. Higher resolution footage enables the easy identification of objects using a single device while minimizing costs. Therefore, object detection research using camera footage data is being actively pursued. (11)(12)(13)(14)(15) Cai et al. conducted an accurate vehicle localization study by detecting lanes through a camera and matching the results with existing HD maps and global navigation satellite system results. (11) Choi et al. identified and utilized lane, lane endpoint, and road signs for localization, and Elfring et al. detected traffic signs. (12,13) Alcantarilla et al. detected changes via masking and performing a pixel-by-pixel comparison of road facility objects in an image; however, the method was limited by insufficient object identification. (14) Heo et al. detected changes by comparing the vector-type object of the HD map and the road facilities of camera image data through adversarial learning for HD map updates. (15) However, the method is not applicable to road signs and traffic lights and can only detect changes in road markings. The aim of this study is to simultaneously detect specific road facilities such as road surface markings, traffic lights, and signs, among other components of the HD map, using a single camera for HD map updates.

Area of study and data
To conduct the research, we used the road environment panoramic image artificial intelligence (AI) data provided by the AI Hub site of the Korean Intelligence Information Society Agency. (16) These data are composed of 2711280 images of 189 types of static road environment objects obtained while driving a total of 3400 km on the major roads in Seoul, the Republic of Korea, and are used to obtain training data for automatic recognition models. Among them, 14184 images with objects were used for transfer learning to complete the algorithm for detecting road facilities. Excluding objects that are insufficient for performing deep learning among the objects in the image, Table 1 shows a total of 12 objects managed in the HD map.
The Ministry of Land, Infrastructure and Transport in Korea defined 14 layers, including 189 road facilities for HD maps. This research focused on just 12 road facilities. Our experimental data did not include all training images of the 189 road facilities. We selected 12 road facilities among the experimental data because they provided enough training data. The 12 road facilities covered three layers: road surface markings, traffic lights, and safety signs.
High-performance computation is required to complete an algorithm that automatically detects road facilities by deep learning using camera footage. Amazon Web Services (AWS) was first used to upload and store the data in an S3 bucket, which is a storage space for AWS. Subsequently, a Python environment was established in AWS for data preprocessing and deep learning implementation via elastic cloud computation. This process is shown in Fig. 1. The

YOLOv3
Unlike other object detection techniques, YOLO is a one-stage detector that detects objects in an image by simultaneously performing localization, computation of the location of an object in the image, and classification to identify the object. It is a detection technique that enables realtime detection via processing. (17) In addition, owing to its high accuracy, it is well known as an exceptional deep-learning-based object detection algorithm. The localization process first involves the indication of an object's position based on its location and boundaries by a bounding box. YOLO is based on partitioning an image into several grid cells and detecting one object for each cell as illustrated in Fig. 2.
The process of YOLO determining the x,y coordinates (b x , b y ) of the center of the bounding box, and the width (b w ) and height (b h ) of the bounding box are as follows:  Figure 3 shows the classification stage, which is the process of classifying and differentiating the background and object and then determining the object. YOLOv3 detects an object under the assumption that the object exists in each grid cell and predicts the final class of the cell based on the binary cross-entropy loss. In this process, the classification of hierarchical classes such as a person and their gender (man/woman) is enabled using an independent logistic classifier. The independent logistic classifier is used instead of a softmax classifier, and it differentiates classes using values for each class as class probabilities. In addition, it is expressed as a value between 0 and 1 in terms of the objectness score, which conveys the confidence in the final prediction class. Subsequently, it is determined whether it should be recognized as an object. This model was trained on the Microsoft Common Objects in Context (MS COCO) dataset. (18) These data contained a set of various daily life photographs created for computer vision learning, in which each object is segmented and labeled.

Transfer learning
The existing YOLOv3 can recognize objects with up to 80 features, including men, women, and dogs. However, owing to the lack of usable training images for recognizing facilities in a road environment, the model must be further trained using these additional images. Therefore,  (17) transfer learning was utilized in addition to facilitating the training of the YOLOv3 algorithm to recognize objects in the road environment, such as road surface markings, traffic lights, and road signs. Using this method, we successfully maintained the framework of YOLOv3 and enabled the algorithm to detect new objects while maintaining its advantages. Transfer learning does not solely involve newly learning an entire convolutional neural network (CNN) that extracts features, but uses the weights obtained when completing YOLO in advance to learn new target data. Subsequently, the new target data are learned only in the fully connected layer, thereby completing the algorithm in a shorter time. Hence, transfer learning is an ideal algorithm for establishing an object detection model for individual datasets. Even in an environment where sufficient data for training are difficult to obtain, transfer learning can be implemented using the generated results, providing a relatively high precision. (19)(20)(21)(22)(23) Furthermore, according to Yosinski et al., if an entire CNN is trained only with individual datasets, then the model may be biased. (24) Hence, transfer learning was considered appropriate for the analysis. Fine-tuning was performed to find the optimal hyperparameters of transfer learning by adjusting the size of training images, the batch size, and the number of epochs. Among the various experiments, two sets of the initial and final values are presented for comparison in Fig. 4. Table 2 presents the optimal learning rates found through the Keras callback function in the learning process.

Results
The image data for 12 types of static objects were transferred to YOLOv3 to detect road facilities. Consequently, the newly created YOLOv3 accurately detected the object when applied to the general road footage, which was not used for learning, as shown in Fig. 5. The     As shown in Fig. 5(a), the straight arrow marking on the road surface and the straight and right turn arrow markings were detected accurately. Moreover, Fig. 5(b) shows that the maximum speed limit sign and two traffic lights were detected. Figure 6(a) shows the result of detecting the crosswalk-ahead warning sign and traffic lights, whereas Fig. 6(b) shows that the speed limit road surface marks and traffic lights were detected accurately. Figure 7 shows the results of quantifying the performance of the object detection model for the 12 classes used to detect the road surface, signs, and traffic lights mentioned in Table  1 by the average precision (AP). AP is an index used to consider both recall and precision, and its value is derived from a precision-recall curve and the calculated area under the curve. In object detection, precision refers to the class matching accuracy of the discovered object, which is important; however, the recall of matching the number of objects in the image is equally important. Although a tradeoff exists between these two scores, they ensure that the model generates relevant results in proportion to the number of predictions. Figure 7(a) shows the 12 classes used as ground truth in 2836 test image files, as well as the number of objects per class in the image. Figure 7(b) shows the AP for each class and the mean average precision (mAP). When evaluating the mAP, evaluation criteria may differ depending on the number of objects present in the image and the difficulty in distinguishing between objects. On the basis of an evaluation, the target road facility object detection model of this study indicated a mAP value of 56.56, which afforded a performance level similar to that of the existing YOLOv3 (mAP value of 57.9), while completing the algorithm for detecting road facilities in real time.
On the basis of the AP values for each class in Fig. 7(b), although most road facility objects were detected accurately, it was confirmed that the object corresponding to a specific road surface and the traffic light demonstrated low performance. In most cases, the inferior detection performance was due to the annotation of smaller objects that were difficult to observe with the naked eye. In particular, it was confirmed that the detection performance was worse than that of the sign because the road surface markings were affected by perspective; therefore, it was difficult to discern the shape of the object owing to distance.

Conclusions
A rapid and accurate update of the HD map is required to promote and implement a safer autonomous driving system. For this HD map update, we conducted a study to identify a method to automatically recognize objects in a road environment. A deep-learning-based object detection model was constructed to detect objects in the HD map with the precision from road driving footage. The evaluation of the model indicated a mAP value of 56.56 as a result of transfer learning using images containing 12 types of road facilities. This study enables the presence or absence of change to be determined by comparing it with the existing HD map by accurately discovering objects in real time using a single camera. Applying this method to the crowdsourcing framework enables simultaneous updates to many vehicles on a road by changing the road environment information. However, owing to the rapid development of state-of-the-art object detection algorithms, we plan to conduct further studies using advanced algorithms to improve the detection of objects on road facilities.