Hyperspectral Image Classification Based on Visible–Infrared Sensors and Residual Generative Adversarial Networks

1China University of Geosciences (Beijing), 100083 China 2Guilin Tourism University, 26 Liang Feng Road, Yanshan District, Guilin, Guangxi 541006, China 3Department of Automatic Control Engineering, Feng Chia University, Taichung 40724, Taiwan 4China Center for Resources Satellite Data and Application, Beijing 100094, China 5International School of Technology Management, Feng Chia University, Taichung 40724, Taiwan


Introduction
The continuous development of imaging technology has enabled us to obtain hyperspectral images (HSIs) with more bands than were previously attainable and continuous spectral features. Such images play an important role in remote sensing fields such as remote sensing image analysis including feature extraction, multivariate data analysis, land cover classification, vegetation refinement classification, and animal monitoring.
According to different designated reference standards, various HSI classification methods can be divided into supervised, semi-supervised, and unsupervised methods. (1,2) Unsupervised classification estimates the prior knowledge of hyperspectral data with no pre-existing knowledge using only the differences among the target hyperspectral data. Hyperspectral data are higher-dimensional and larger in size than ordinary optical image data, so they require special calculations. (3)(4)(5) The lack of a priori information obscures the meaning of unsupervised classification data as shown in Refs. [6][7][8][9]. Supervised classification is based on prior knowledge of the target and reference criteria that determine the categories of the non-sampled data.
Owing to its high precision, supervised classification is often preferred in HSI classification. Traditional supervised classification methods, such as polynomial logistic regression (PLR) and support vector machine (SVM), are widely used in HSI classification because they handle large input spaces. However, the performance of these methods is usually degraded by a limited number of training samples, and a large number of samples are required to achieve high prediction accuracy. Therefore, classifying hyperspectral data by supervised classification is a challenging task. Considering the limited availability of training samples for HSI classification, researchers have proposed semi-supervised and active learning algorithms.
With the rapid development of pattern recognition, deep learning algorithms have been widely used in HSI classification. Powerful deep learning models can effectively combine spatial and spectral information, and avoid the complicated and manual feature engineering process by automatically extracting an effective feature representation of the problem domain, namely, HSI classification. (10) Graham extracted HSI features for image classification using an autoencoder. (10) Lin et al. proposed a new method using convolutional neural networks (CNNs) for HSI classification. (11) Similarly distributed CNN technologies using the A 3p viGrid architecture were proposed and run by Amaldas and coworkers. (12)(13)(14) Goodfellow et al. were the first to be inspired by zero-sum game theory to develop the original generative adversarial network (GAN). (15) The GAN framework consists of two antagonistic networks: a generative network (G) and a discriminative network (D). To improve the performance of the standard WGAN, Gulrajani multiplied the gradient specification of the WGAN input by a penalty term. Although the GAN has accurately classified HSIs, most GAN-based studies have focused on the spatial and spectral domain effects on the GAN model without considering the training stability or the ability of the network structure to learn complex hyperspectral features. (16) Zhan proposed a novel semisupervised algorithm for the classification of hyperspectral data by training a customized GAN for hyperspectral data. The GAN constructs an adversarial game between a discriminator and a generator. (17) A large number of novel models based on a GAN have been proposed for HSI classification. (18) A hyperspectral GAN (HSGAN) framework automatically extracts the spectral features in HSI classification tasks. When training an HSGAN with unlabeled hyperspectral data, the generator produces hyperspectral samples similar to authentic ones.
The features of the discriminator are available for classifying hyperspectral data among a small number of labeled samples using an HSGAN. A GAN is very similar to a CNN and has enabled rapid progress in computer vision, in which the development of distributed processing method for neural networks has been a recent trend. Arjovsky identified a new GAN called a Wasserstein GAN (WGAN), which efficiently minimizes the approximated Wasserstein distance. (18) Radford presented a novel network architecture called a deep convolution GAN that enhances the training stability and quality of the generated outcomes. (19) In addition, the training time of hyperspectral classification models based on the GAN is difficult to estimate. In the HSGAN, the gradient often disappears during the training process. To alleviate this issue of HSGAN classification, we propose a dense residual GAN (ResGAN) for HSI classification tasks. Our generative network (GN) includes a memory mechanism (MM) that boosts the GN performance (and hence the ResGAN performance) using a dense residual unit (DRU). The experimental results confirmed the improved test accuracy and visualization results of this DR-GAN. (15)(16)(17) The contributions of this paper are as follows. First, we briefly introduce previous GANbased methods and residual learning. Second, we describe the proposed ResGAN. Third, we provide details of experiments in which we compare the performance of ResGAN with that of two HSI classification methods. Finally, we draw conclusions through a discussion.

Residual network
Classification is a popular method for mining the rich information in HSIs. Image processing by deep learning methods has thus far achieved good classification results, but deep learning methods are prone to overfitting. Residual networks (ResNets) can alleviate overfitting. An identity map added between the input and the output enables easy parameter optimization and the extraction of feature information.
In this research, we introduce a ResNet model for HSI classification with optimization by batch normalization reducing the dependence of the network on the initial parameters and improving the generalization of the model. To reduce the effect on classification accuracy caused by training samples, the model generates dummy samples. When tested on two different HSIs, the method demonstrated its potentially broad applications in HSI classification. Figure 1 shows the architecture of the HIS GAN network. Figure 2 shows the structure of the residual network for noise spectral image classification.

GAN model
Goodfellow and coworkers (15)(16)(17) were the first to be inspired by zero-sum game theory to develop the original GAN. GAN-based HSI classification takes a 1D noise vector as the network G input and generates a vector approximating the real spectral data by two full-join and two convolution operations. The purpose is to deceive a diverse discriminant network D that is trained to distinguish reconstructed and real images.

Our proposed architecture
In Fig. 3, we first describe the devised GN in ResGAN and then demonstrate the discriminative network (DN). Finally, we introduce the modified loss function of ResGAN based on the Wasserstein GAN with a gradient penalty (WGAN-GP), which is a GAN that uses the Wasserstein loss formulation plus a gradient norm penalty to achieve Lipschitz continuity.

Feature extraction
The features are extracted by two convolutional layers and operate as where W FE,1 and W FE,2 represent n FE,1 convolution kernels of size c × k FE,1 × k FE,1 and n FE,2 convolution kernels of size n FE,1 × k FE,1 × k FE,2 , respectively. c denotes the number of channels of the input image I L , k FE,1 and k FE,2 are the spatial sizes of the convolution filter, and B FE,1 and k FE,2 represent the biases. The '⁎' operator performs a convolution operation, g( . ) is the activation function, and FE is the output of the feature extraction, which is input to the DRUs. In this paper, the activation function g( . ) is the parametric rectified linear unit (PReLU). The PReLU form of g( . ) is expressed as where a t represents a learnable parameter and t denotes the iteration time. When the parameters of the network are updated in reverse, a i is updated as where µ, ε, and L represent the momentum, learning rate, and loss function, respectively.

Dense residual units
We assumed DRUs, each with the architecture shown in Fig. 3. The operation of the kth DPR (DRU p ) is as follows.
The kernels and biases of the three successive convolutional layers (indexed by 1, 2, and 3) are represented by W k,1 to W k,3 and B k,1 to B k,3 , respectively. S k,1 to S k,3 denote their corresponding weighted-sum layers and D k,1 to D k,3 denote the corresponding outputs of their forward convolutional layers. D p denotes the output of DRU p .
The outputs of the preceding convolutional layers in each DRU (blue lines in Fig. 3) are admitted into the posterior convolutional layers, forming the short-term memory. The preceding outputs of the DRUs (red and purple lines in Fig. 3) are admitted into the later layers, similarly forming the long-term memory. The former DRU and convolutional layer outputs are directly connected to the later layers. This configuration not only reduces the number of feed-forward features but also extracts the local dense features. Together, these connections realize the MM. When the previous DRU and all the convolutional layers are admitted into the later layers, the number of features must be reduced to ease the burden on the network. For this purpose, we apply weighted-sum layers S p,1 to S p,2 that adaptively learn the specific weight of each memory and decide the amounts of long-term and short-term memories to be saved. S p,1 to S p,3 in DRU p are operated by a local decision function.

Residual learning
Recently, residual networks have accomplished great achievements in low-level to high-level computer vision tasks. In this study, the full potential of residual networks is realized by employing local residual learning (LRL) and global residual learning (GRL). The whole operation of the residual learning part is formulated as where S RL denotes the weighted-sum layer, W RL,1 and B RL,1 represent the kernel and bias of the convolutional layer, respectively, and D 1 to D d represent the successive outputs of the d DRUs.
R ws , R ws,1 , and R denote the outputs of the weighted-sum, convolutional, and element-wise sum layers, respectively, in the residual learning part. In the residual learning part, LRL is performed between a DRU and a weighted-sum layer, whereas GRL is implemented between the input image I L and the element-wise sum layer (Fig. 3). The weighted-sum layer SRL extracts the hierarchical features obtained from the previous DRUs through LRL and decides their proportions in the subsequent features. SRL is operated by a global decision function that compares S p,1 with S p,3 in DRU p . The features are further exploited by the convolutional layer W RL,1 , and the combined LRL and GRL improve the GN performance and reduce the overfitting risk.

Discriminative network
Our model differs from the DN HSGAN in two respects. Firstly, the last sigmoid layer is replaced with a leaky ReLU layer.
The discriminative model in the HSGAN mainly attempts true and binary classifications, whereas the DN in ResGAN approximates the Wasserstein distance between the classified objects. Second, we remove the batch normalization (BN) layers from the DN and impose the gradient penalty on each sample individually. The overall architecture of the DN is shown in Fig.  4. To improve the stability of the GAN training, WGAN-GP enforces a soft version of the penalizing constraint on the gradient norm of random samples x − Px.

Experimental setting
The classification performance of the proposed ResGAN was compared with that of two established HSI classification methods (HSGAN and residual neural network). The classification performance of the methods was measured by three popular indexes: the overall accuracy (OA, defining the probability that an individual is correctly classified), the average accuracy (AA, obtained by summing the accuracies of all classes and dividing the result by the number of classes), and the Kappa coefficient (Kappa, a reliability index of the ratings of different raters). The graphics processing unit was a GPU NVIDIA 1080Ti graphics card. All experiments were executed in TensorFlow and accelerated by operating six NVIDIA GTX1080 GPUs, each with 11 GB memory, in parallel. Owing to the size of the dataset, the computational complexity of the network, and the need for parallelization, the setup was scaled using the aforementioned GPUs and the overall setup was made scalable using the limited GPU memory available. The entire training was completed in approximately four days and all the required experimental trainings were set with a spatial window of 5 × 5, batch sizes of 128, and the learning rate set at 0.001 for all epochs.

Description of datasets
The experiments to evaluate the performance were conducted on two hyperspectral datasets: the classical hyperspectral dataset of Indian Pines and a dataset of HSIs of Pavia University (Italy).
The first dataset was gathered in northwestern Indiana over the Indian Pines test site by an airborne visible/infrared imaging spectrometer sensor by the corresponding authors (Fig. 5). After removing the water absorption bands, each image (size 145 × 145 pixels) consisted of 200 spectral bands. The spectral coverage was 0.4 to 2.5 μm and the spatial resolution was 20 m.
The second dataset comprised of a Pavia University image acquired by a reflective optics system imaging spectrometer sensor (Fig. 6). After removing the junk bands, sub-images of 610 × 340 pixels with 103 spectral bands were retained for analysis. The spatial resolution of an image was approximately 1.3 m. The ground truth information in the Indian Pines images was differentiated into nine land use classes.

Experimental results for Indian Pines dataset
The training set was obtained by randomly selecting 10% of the samples in each class of the Indian Pines dataset, and the other samples were reserved for testing. Table 1 shows the  Table 1 present the results of each category, and the OA, AA, and Kappa values of all classes are the last three values. The results confirm that the feature extraction ability of ResGAN is higher than that of the residual network and HSGAN methods. Figure 7 shows that the ResGAN model has improved classification accuracy on the classical Indian Pines dataset compared with the other methods, particularly in dense boundary regions. This shows that the model significantly improves the classification accuracy of small highdimensional samples.

Experimental results for Pavia University dataset
The training set was obtained by randomly selecting 10% of the samples in each class of the Pavia University dataset, and the other samples were reserved for testing. Table 2 shows the numbers of training and test samples, as well as the classification accuracies and the related standard deviations of the three algorithms, namely, HSGAN, Residual, and ResGAN. The first   Figure 8 presents the visual classifications of the three methods on the Pavia University dataset. It shows that ResGAN has better applicability than the other methods for this urban landscape hyperspectral dataset with relatively large spatial coverage and rich feature types.

Conclusions
We presented a novel classification method based on ResGAN, which handles the classification of HSI data with a lack of prior knowledge. We confirmed that ResGAN obtains the features from unlabeled data using visualized maps of the generative model. Further fine adjustment would yield a high-performance classifier requiring only a few labeled samples. We experimentally compared the performance of ResGAN with that of other HSI classification methods, and the proposed method outperformed the other two methods.

About the Authors
Hui-Wei Su is the dean of the School of Tourism Data, Guilin Tourism University. His research interests are in artificial intelligence, surveying and mapping science and technology, and computer image processing. He has more than 30 projects entrusted by enterprises and institutions, and has published 20 high-level academic papers and 3 academic monographs.
Ri-hui Tan graduated from Guangxi University with bachelor's and master's degrees in automation from the School of Electrical Engineering. After graduation, she worked in the Tourism Data College of Guilin Tourism University as a lecturer. She is mainly engaged in big data and artificial intelligence involving the Guangxi tourism economy and in works related to tourism data. IFaS Trier University, Germany. He has been actively involved in publishing several research papers and periodicals in top international conferences and journals. His research and teaching interests include big data in biomedical imaging, machine/deep learning, AI, high-performance grid computing, bioinformatics, applied social and cognitive psychology in education, game dynamics, material flow management, renewable energy systems, and policy and decision making. Avinash has studied a diverse range of disciplines in engineering and social sciences and is fluent in a wide range of scholarly domains specializing in higher education and engineering. He serves as an editor and steering committee member on numerous research entities and regularly contributes reviews to the Journal of Supercomputing. He has recently received a grant for supercomputing research from the Ministry of Science and Technology, Taiwan, for his work on the X-ray detection of COVID-19 using supercomputing.