YOLOv3 Object Detection Algorithm with Feature Pyramid Attention for Remote Sensing Images

In object detection in remote sensing images, owing to the complex background environment, there are problems of poor robustness to interference and low detection accuracy for small objects. The algorithm proposed in this paper combines the attention mechanism with the spatial pyramid structure to improve the You-only-look-once algorithm version 3 (YOLOv3), and it also includes the pyramid attention module to improve the performance of the detection model. The feature pyramid attention module is introduced into deep features, and the feature pyramid attention structure is combined with global context information to better learn object features. The global attention upsampling module is introduced into low-level features, and the global information provided by global pooling is used as a guide to select low-level features. The object detection model can more fully acquire the features of important information and selectively suppress irrelevant features, thereby improving the detection accuracy of the algorithm. To verify the performance of the proposed algorithm, it is used to detect airplanes, storage tanks, ships, baseball diamonds, and running tracks in remote sensing images, and its performance is compared with that of other algorithms. Experiments prove that the proposed algorithm has better detection performance and can improve the detection accuracy of each object in remote sensing images.


Introduction
Object detection is an important research problem in the fields of computer vision and image processing, and it has been a research hotspot in theory and application in recent years. It has important application value in both military and civilian fields. Remote sensing technology is widely used in crop monitoring, (1) environmental change and disaster monitoring, resource exploration, and military reconnaissance. Therefore, the application of object detection to the field of remote sensing has important research value. However, remote sensing images are more complex and changeable than natural image scenes and the object scales are different, which generate many challenges in the detection of objects of different scales in remote sensing images. Therefore, the detection of remote sensing objects in complex scenes has high significance in research. (2) In recent years, with the rapid development of deep learning and the excellent results of the with a larger receptive field and more parameters as the backbone network, adds an SPP module to increase the receptive field, and uses a path aggregation network (PANet) (22) for multichannel feature fusion, adding a series of tuning techniques to achieve higher accuracy and speed in real-time target detection algorithms.
Compared with natural images, remote sensing images have more complex backgrounds and more interference, which require higher detection performance of algorithms. Migrating a detection network based on natural images to detect objects in optical remote sensing images is not ideal. Object detection in remote sensing images has two features. One is that the detection performance is poor when the object and the background are similar. The second is that there are many small objects in remote sensing images. Since small objects contain less information, missed detections and false detections are more serious. It is thus necessary to improve the detection of small objects. In response to the above two problems, many researchers have proposed a series of research methods involving the design of a fusion module, adding an attention mechanism, optimizing the algorithm, and improving the performance of the model. Although the above methods introduced feature fusion methods to improve the detection accuracy of small objects, the information between feature layers was not fully utilized in the fusion process, and the amount of calculation was increased. To improve the performance of the detector, the attention mechanism was introduced to solve the problems associated with a complex background, but the improvement in performance was not obvious. In response to the above problems, by combining the advantages of the YOLOv3 algorithm, we add a pyramid attention module to improve the feature extraction capabilities of the network, merge the pyramid capabilities of different scales, enhance the extraction of features of the object, and further improve the detection accuracy of the algorithm for small targets and its robustness to background interference. We use this algorithm to detect airplanes, oil tanks, ships, baseball diamonds, and running tracks in remote sensing images, and compare and analyze its performance with that of other algorithms.

Principle of YOLO algorithms
The proposed YOLOv1 algorithm converts the object detection problem into a regression problem. Inputting a picture into the detection network can directly return the position coordinates and object category of the object bounding box, so as to achieve end-to-end detection and avoid lengthy processing procedures. The YOLOv1 algorithm first unifies the picture into a size of 448 × 448, and then divides it into S × S cells, with the center of the object at the center of the grid; this grid is used for predicting the confidence, category, and location of the object. The YOLO algorithm uses GoogleNet as the backbone network, including 24 convolutional layers and two fully connected layers. The convolutional layer extracts features from the image and outputs object category probabilities and coordinates in the fully connected layer.
The YOLOv2 algorithm uses a new network structure, Darknet-19, and is based on YOLOv1. It introduces the anchor mechanism of Faster-RCNN and uses higher resolution images while adding fine-grained features and optimization strategies such as batch standardization and dimensional clustering to improve the speed and accuracy of detection by the algorithm.
The YOLOv3 algorithm, which is based on YOLOv2, has further improved performance. It adopts the Darknet-53 structure with a deeper network layer, and adds a residual module to the network to better extract object features. Owing to the overlap of some categories (such as women and persons), multilabel classification is used instead of Softmax with a logistic classifier. To improve the detection accuracy for small objects, we use an upsampling approach and fusion method on the fusion feature maps of multiple scales. The following is a detailed introduction to the network structure and multiscale detection.

1) Darknet-53
Darknet-53 adopts the idea of the ResNet (23) network and adds residual modules to the network, where 1, 2, 8, 8, and 4 are the numbers of repeated residual modules, and each residual module consists of two convolution layers and a residual layer. The entire network structure has no pooling layer, and the downsampling operation of the network is completed by setting the convolution step size to 2. After this convolution layer, the size of the image is reduced by half. The specific network structure is shown in Table 1. 2) Multiscale detection The features learned at the bottom of the network are simple and intuitive, and the geometric contour and position information is rich, which is beneficial for object positioning and smallobject detection. The higher the level is, the lesser the geometric detail and position information are, the more abstract and global the learned features are, and the richer the semantic information is, which is suitable for large-object detection and complex-object classification. Therefore, YOLOv3 uses multiscale detection to detect multiple levels of feature maps, as shown in Fig. 1.
As shown in the figure, after the 79th layer, a 32-fold downsampling prediction result is obtained after the convolution operation. The scale size is 13 × 13, the downsampling multiple is high, and the receptive field of the feature map is relatively large, which is suitable for detecting larger objects. The result of the 79th layer is combined with the result of the 61st layer through upsampling, and then the prediction result of 16-fold downsampling is obtained through the convolution operation. The scale size is 26 × 26, with a medium-scale receptive field, which is suitable for detecting medium-scale objects. The result of the 91st layer is upsampled and combined with the result of the 36th layer. After the convolution operation, an 8-fold downsampling result is obtained. The scale size is 52 × 52, and the receptive field is the smallest, which is suitable for detecting small objects.

Attention mechanism
In essence, the attention mechanism is similar to the human selective visual attention mechanism and is a model that simulates the attention mechanism of the human brain. It can be seen as a combination function, by which the probability distribution of attention is calculated to highlight the impact of a key input on the output. The core goal of the attention mechanism is to select more critical information for the current task goal from a large amount of information and give it a higher weight.
Specifically, as shown in Fig. 2, the attention mechanism model maps an input X = (x 1 , x 2 , ..., x n ) to an output Y = (y 1 , y 2 , ..., y m ). In the mechanism model, the encoder transforms an input sequence X into an intermediate semantic C = f(x 1 , x 2 , ..., x n ) through The implementation process is shown in Fig. 3. Imagine the constituent elements in the source as a series of <Key, Value> data pairs. At this time, given a certain element Query in the target, by calculating the similarity or correlation between Query and each Key, one can obtain the weight coefficient of each Key corresponding to each Value, and then each Value is weighted and summed to obtain the final attention value. In essence, the attention mechanism performs a weighted summation of the Value values of the elements in the source, and Query and Key are used to calculate the weight coefficients of the corresponding Values.
The specific realization of the attention mechanism can be expressed by where L x = ||Source|| represents the length of the source, and the meaning of the formula is as described above. Conceptually, attention is still understood as selectively extracting a small amount of important information from a large amount of information and focusing on this important information, ignoring the least important information. The focusing process is reflected in the calculation of the weight coefficient. The larger the weight, the more focus there is on the corresponding Value, that is, the weight represents the importance of the information, and Value is the corresponding information.

Object Detection Algorithm Combined with Attention Mechanism
The YOLOv4 algorithm uses a variety of tuning strategies. Although it performs well in terms of accuracy and speed, the improved network structure is more complex, while the YOLOv3 algorithm structure is relatively simple and flexible and is suitable for remote sensing images with a large amount of data. However, it is more effective for natural image detection, and the background environment of remote sensing images is more complicated, so the network hierarchy of the YOLOv3 algorithm is not applicable. The fusion method adopted by YOLOv3 only indirectly integrates the low-level and high-level semantic information, and misses much of semantic information. In addition, the background of remote sensing images is more complicated and has a greater interference effect. When YOLOv3 detects remote sensing images, if the object is similar to the background, the detection performance is poor. To resolve the above problems, we combine the attention mechanism with the spatial pyramid structure based on YOLOv3 to improve the model's robustness to background interference. The network structure has five main parts: input, backbone network, pyramid attention module, prediction, and output. The network structure is shown in Fig. 4, and each module will be introduced in detail next.

1) Feature pyramid attention module
This module combines the attention mechanism with the pyramid convolution. The attention mechanism increases the weight of the part with the object information and obtains the output with attention. At the same time, the pyramid convolution structure uses convolution kernels with different sizes (3 × 3, 5 × 5, and 7 × 7) to represent different receptive fields, which can solve the problem of different objects and different scales. Compared with channel attention, this module has richer pixel-level information. The combination with the pyramid structure produces better pixel-level attention applied to deep-level features and improves the detector's robustness to background interference, thereby improving detection accuracy.
As shown in Fig. 5, after the high-level features are extracted, the pooling operation is no longer performed. Instead, the higher-level semantics are realized through three continuous convolutions. The higher-level semantics will be closer to the real coordinate situation and pay more attention to the object. Therefore, the higher-level semantics is used as a kind of attention guide.
To obtain the output result, the original feature map is subjected to a 1 × 1 convolution operation and linearly superimposed with the operation result of the pyramid feature fusion module. This method strengthens the characteristics of the desired target through the attention mechanism and improves the target's robustness to interference. At the same time, the pyramid convolution structure adopts convolution kernels of different sizes, which represent different receptive fields, realizes multiscale detection, and improves the detection accuracy of small target objects. The high-level feature resolution is small, and the use of a large convolution kernel will not significantly increase the computational burden.

2) Global attention upsampling module
This module can not only more effectively adapt to feature mapping at different scales, but also provide guidance information for low-level feature mapping in a simple way, so as to select more accurate resolution information. In addition, this module uses the extraction of global context information of high-level features to guide the weighting of the information of low-level features. This process also does not significantly add to the computational burden.
As shown in Fig. 6, we use high-level features as a guide and set the corresponding weights so that the weights of the bottom and high levels are consistent, and the high-level features use global pooling to obtain the weights. After multiplying, we add up the bottom layer. In this way, a new high-level integration is carried out while reducing the complexity of the calculation. Specifically, a 3×3 convolution is used for channel processing of low-level features, and then the global pooled information is used for weighting to obtain the weighted low-level features, which are upsampled and then added to the deep-level information. To improve the feature extraction capability of the network, the feature pyramid attention module extracts different levels of nonlinear information through the proposed attention mechanism, and the pyramid structure extracts feature information of different sizes and increases the pixel-level receptive field. The global attention upsampling module guides the underlying features and selects more accurate resolution information. The two modules fuse the information extracted from the high-and low-level features to improve the robustness to interference and the detection capability for small objects.

1) Design and matching of anchor boxes
For the design of the anchor boxes, K-means clustering is used to obtain the sizes of the anchor boxes. Three types of anchor boxes are set for each scale, and nine sizes of anchor boxes are obtained by clustering. Larger anchor box of 116 × 90, 156 × 198, and 373 × 326 are matched on the smallest 13 × 13 feature map, with which larger objects are detected. Mediumsize anchor boxes of 30 × 1, 62 × 45, and 59 × 119 are matched on the medium-size feature map, with which medium-size objects are detected. On the larger 52 × 52 feature map, smaller anchor boxes of 10 × 13, 16 × 30, and 33 × 23 are matched, with which smaller objects are detected. Each cell corresponds to three anchor boxes. The anchor box corresponding to the ground truth box with the largest IOU and its corresponding bounding box are used to predict the object.

2) Prediction mechanism
The direct prediction method is adopted to predict the relative offset value of the center point of the bounding box relative to the upper left corner of the corresponding cell. After learning the offset, the anchor box coordinates originally given by the network can be fine-tuned by linear regression to gradually approach the ground truth and obtain the coordinates of the prediction box. The coordinates can be expressed by  where t x , t y , t w , and t h are four offsets: t x and t y , are the predicted coordinate offsets and t w and t h are the scale scaling offsets; c x and c y are the coordinates of the upper left corner of the corresponding cell; and b x , b y , b w , and b h are the coordinates of the predicted value.

3) Loss function
The loss function is used to measure the quality of a set of parameters. The measurement method is used to compare the difference between the network output and the real output. The loss function is mainly used to increase the accuracy of the positioning and the object.
The loss function includes three parts: the bounding box positioning error, confidence error, and classification error. Among them, the bounding box positioning error adopts the complete intersection over union (CIOU) loss, which considers not only the overlap area, but also the center point distance and the aspect ratio. The confidence error and classification error adopt the cross-entropy loss function, whose formula is Here, L box represents the positioning error of the bounding box, which is the difference between the coordinates obtained by the anchor box when predicting the bounding box and the real coordinates. L cls represents the confidence error, which is calculated using the cross-entropy loss, which represents the probability that the target frame contains the target. L obj represents the classification error. When the bounding box determines that there is a target in the current box, the bounding box will calculate the classification loss. The positioning error of the bounding box is where d and c represent the center points of the prediction box and the ground truth box, respectively. d represents the Euclidean distance between the two center points, and c represents the diagonal distance between the prediction box and the smallest bounding rectangle of the ground truth box. v is a parameter used to measure the consistency of the aspect ratio and α is a parameter used to make trade-offs, which are calculated as follows: Here, w gt and h gt represent the width and height of the ground truth box, w and h represent the width and height of the prediction box, and where A and B represent the areas of the prediction box and the ground truth box, and I and U represent the intersection area and the union area, respectively. The cross-entropy loss is calculated as where p represents the true value and q represents the predicted value. Cross-entropy loss is used to evaluate the difference between the current training probability distribution and the true distribution. Reducing the cross-entropy loss improves the prediction accuracy of the model. From the formula of the cross-entropy loss function, the confidence error is where S 2 represents the number of grids, B represents the number of anchor boxes generated by each grid, I ij obj represents whether the jth anchor box of the ith grid is responsible for predicting the target: if it is responsible, then I ij obj = 1, otherwise, it is 0. c i j is the probability that the target object is contained in the prediction box. ˆj i c represents the ground truth value, and its value is determined on the basis of whether the jth anchor box of the ith grid is responsible for predicting an object: if it is responsible, then ˆj i c = 1, otherwise, ˆj i c = 0. From the formula of the cross-entropy loss function, the classification error obtained is

Experiments and Analysis
We validated the proposed algorithm through experiments. For the experiments, under the Ubuntu16.04 operating system, a computer with an NVIDIA GeForce GTX1070Ti GPU graphics card was configured, with CUDA10.0 and CUDNN7.1 installed to accelerate the GPU.
TensorFlow deep learning was configured on the basis of the Anaconda 3.6 frame. We used Darknet-53 as the network framework, selected remote sensing data sets for the experiments, and compared the proposed algorithm with other object detection algorithms. To further verify the detection performance of this algorithm, a test set with more small targets was selected, and the algorithm was compared with Faster-RCNN and the YOLOv3 algorithm. We can obtain better detection results from the detected images. Compared with other algorithms, the detection accuracy of the algorithm in this paper is higher, especially for small target objects. The detection accuracy is above 90% and the highest accuracy is 99%.

Data sets
The data sets used were NWPUVHR-10, RSOD-Dataset, and DOTA, and we selected a total of 1860 images containing airplanes, ships, storage tanks, baseball diamonds, and running tracks. We labeled the image data, and at the same time converted the data into the format of the VOC data set. Finally, we randomly divided the samples into the training set, validation set, and test set at the ratio of 6:2:2. Targets with an area less than 32 × 32 pixels were considered small targets, those with an area between 32 × 32 and 96 × 96 pixels were considered mediumsize targets, and those with an area greater than 96 × 96 pixels were considered large targets. The specific data set distribution is shown in Table 2. We adopted data enhancement, rotation, cropping, and other operations to increase the amount of data.

Experimental results
We used the proposed algorithm to train and test the data set, and some of the detection results obtained are shown in Figs. 7-11. Figure 7 shows the detection of airplanes and storage   tanks. The sizes of the airplanes in the picture are different, and the storage tanks are densely arranged. Figure 8 shows the detection of airplanes. In the picture, the distribution of the airplanes is scattered. Figure 9 shows the detection of ships and oil tanks. The storage tanks are arranged very densely and their scale is small. Figure 10 shows the detection of ships, which are small and relatively long and narrow, with some ships having a similar color to the background. Figure 11 shows the detection of running tracks and baseball diamonds, which are large and clear targets.

Accuracy evaluation
The evaluation indicators used are average precision (AP) and mean average precision (mAP). mAP is used to measure the average detection accuracy of multiple types of target. The higher the mAP, the higher the comprehensive performance of the model in all categories. AP and mAP are given by The precision-rate-recall rate (P-R) curves of each category and the mAP are respectively shown in Figs. 12 and 13.
The average accuracy is used to measure the accuracy of the detection algorithm from the two perspectives of recall and accuracy. It is an intuitive standard for evaluating the accuracy of the detection model and can be used to analyze the effectiveness of detecting a single category. The calculation formulas for recall and accuracy are respectively Here where N test is the number of samples in the test set and T time is the time taken when testing the test set. The running speed, AP, and mAP are calculated for each category. As shown in Table 3, the proposed algorithm achieves an average accuracy of 94.06% and the detection accuracy for each type of target is above 90%. The lowest detection accuracy is 92.21% and the highest detection accuracy is 96.83%. The lowest detection speed is 28 FPS and the highest speed is 33 FPS, showing the high detection performance of the algorithm.

Comparative experiment
To demonstrate the superior detection performance of the proposed algorithm, it is compared with other algorithms. The results obtained are shown in Table 4. From the data in the table, it can be seen that the proposed algorithm has the highest detection performance, with 8.76 percentage points higher accuracy than the Faster-RCNN algorithm and a 7 FPS higher speed. Compared with the YOLOv3 algorithm, the accuracy is improved by 1.56% and the speed is increased by 2 FPS.
To further verify that the proposed algorithm has high detection accuracy for small targets and is robust to interference, images with dense small targets and similar backgrounds are  Fig. 14.
It can be seen that the proposed algorithm has the highest rate of correct detection and the lowest rates of false detection and missed detection. To illustrate the stable performance of the algorithm, we give the following examples.
The images in Fig. 15 show the detection results of airplanes. The pink boxes indicate detected airplanes and the blue boxes indicate missed detections. It is found that Faster-RCNN  and YOLOv3 missed objects. When detecting a dense arrangement of airplanes, as in Fig. 15, Faster-RCNN and YOLOv3 failed to detect smaller airplanes. The proposed algorithm had no missed detections and showed higher accuracy. Figure 16 shows the detection results of storage tanks and ships, where the yellow boxes represent the detection results of ships, the green boxes represent the detection results of storage tanks, and the red boxes represent missed detections. For a dense arrangement of storage tanks, it was found that larger storage tanks were correctly detected by all three algorithms. However, Faster-RCNN and YOLOv3 failed to detect some of the smaller tanks. The proposed algorithm had no missed detections and showed higher accuracy. This experiment shows that the detection performance of the proposed algorithm has higher accuracy for small targets.

Selection of different levels of feature pyramid attention modules
The feature pyramid attention module integrates a variety of features of different scales through a U-shaped structure, and the pyramid convolution structure uses convolution kernels of different sizes (3 × 3, 5 × 5, 7 × 7, and 9 × 9). In the selection process, after many analyses and experiments, a three-layer convolution operation can more accurately merge the adjacent scale features between the upper and lower layer features, and improve the feature extraction capability of the network. In the experiment, convolutional structures with different layers were constructed, which were named build-1 (3 × 3), build-2 (3 × 3, 5 × 5), build-3 (3 × 3, 5  , and build-4 (3 × 3, 5 × 5, 7 × 7, 9 × 9), and we conducted experiments on these convolutional structures to find the most suitable model for remote sensing image detection.
The experimental results are shown in Table 5. It can be seen from the experimental data that the best experimental results were obtained when build-3 was added. At the same time, it was found that more convolution kernel layers are not necessarily better. When saturation is reached, the effect of feature extraction is not further improved.

Choice of loss function
When evaluating the performance of object detection, IOU is used to evaluate the overlap rate of the prediction box and the ground truth box, which reflects the effectiveness of detection. However, IOU only considers the change in the overlapping area, not the change in the nonoverlapping area or the change in size. A higher overlap ratio of the obtained cross-to-bin ratio does not mean higher accuracy of the obtained prediction box. This evaluation method reduces the positioning accuracy of the prediction box. Therefore, there will be a large number of overlapping prediction boxes during the detection process, similar to the mutual occlusion of objects in natural images. When encountering densely distributed objects, the overlap phenomenon is more serious, which causes objects to be missed and reduces the detection recall rate. To improve the detection accuracy, the bounding box positioning error in the loss function is changed to the CIOU loss, which considers not only the overlap area, but also the center point distance and aspect ratio. The problem of the large overlap of prediction boxes is thus avoided, and the detection accuracy is improved.

Limitations
The algorithm in this paper includes the feature pyramid attention module, so that the object detection model can more fully obtain the features of important information and selectively suppress irrelevant features. This improves the detection performance: it not only improves the accuracy of small-object detection, but also alleviates the problem of background interference. However, it is still necessary to improve the real-time performance of the algorithm and further improve the efficiency of processing remote sensing data.

Conclusion
Through the analysis of existing object detection algorithms, this paper aims at the problems of high computational complexity and algorithm efficiency of traditional pyramid models. Through information screening, we integrate the attention mechanism with the pyramid model and improve the feature extraction ability of the network on the basis of almost no increase in the amount of calculation, thereby improving the detection accuracy of the algorithm. Specifically, this algorithm combines the attention mechanism with the feature pyramid based on the YOLOv3 algorithm. We add the pyramid attention module, which mainly includes the feature pyramid attention module and the global attention upsampling module. We also introduce the feature pyramid attention module into deep-level features combined with global context information to better learn object features. The global attention upsampling module is introduced into low-level features, and the global information provided by global pooling is used as a guide to select low-level features. Finally, the filtered low-level features and highlevel features are combined to improve the detection accuracy of the algorithm model for small objects and the robustness to background interference. To verify the effectiveness of the algorithm, we compared it with other algorithms and demonstrated its superior performance. We also verified its detection accuracy for small objects through the analysis of false detections, missed detections, and the accuracy rate. The proposed algorithm improves the detection accuracy of each object in the remote sensing image, thus improving the detection performance. At the same time, we found that combining the RPN network based on a one-stage algorithm can also play an important role in the research of object detection. We will experiment and analyze it in the follow-up work.