Genetic-algorithm-based Convolutional Neural Network for Robust Time Series Classification with Unreliable Data

Finding robust solutions to time series classification problems using deep neural networks has received wide attention. However, unreliable data makes classification very difficult. Traditional deep neural networks cannot effectively solve problems with strong noise. In this paper, we propose a hybrid convolutional neural network (CNN) model combined with a genetic algorithm (GA) for time series classification (TSC) with unreliable data. To obtain a robust CNN structure, even though network structural optimization is an NP-hard problem, we design a GA for network structure optimization. Several benchmarks and actual datasets are adopted, and tests are carried out to prove the effectiveness of the proposed GA-based CNN. The numerical results show that our approach has better performance than other state-of-the-art deep neural networks.


Introduction
Time series classification (TSC) is one of the most important problems in machine learning and data mining. The target of TSC is to discover a classification model that identifies the data characteristics, where the data are a description of a series of datasets indexed in time order. TSC problems arise in a wide range of fields including natural language processing, image processing, scheduling, logistics, medicine, and health. For more extensive explanations of the various TSC problems, the reader is referred to previous reviews. (1,2) The research on TSC has been ongoing for decades. There are three categories of approaches for TSC, i.e., distance-based classification, feature-based classification, and support vector machine (SVM) and model-based classification. (3) For distance-based classification, Faloutsos et al. proposed a Euclidean-distance-based approach, which is a predefined similarity measure for TSC. (4) Yi and Faloutsos discussed various distance-based measurement strategies and improved the Euclidean distance measurement strategy with common Lp-norms to form extensions of the Euclidean distance. (5) Frentzos et al. proposed a dissimilarity metric (DISSIM) to measure the spatiotemporal dissimilarity between two similar time series. (6) The above approaches are lock-step measures. Although they are widely adopted measurements, Berndt and Clifford (7) and Keogh and Ratanamahatana thought that the Euclidean-distance-based measurement strategy and its extensions were insufficiently robust as similarity measures. Thus, they proposed dynamic time warping (DTW), a classic speech recognition tool, which provides a better match with another time series through the compression of the comparable time series. Chen et al. presented an edit distance with real sequence (EDR), in which a threshold parameter was adopted, where the distance was quantified according to a threshold given in advance. (9) Chen and Ng proposed an edit distance with real penalty (ERP), in which DTW and EDR were combined and a constant reference point was set to compute the distance. (10) Vlachos et al. studied the longest common subsequence (LCSS), which provided a strategy to consider constraining the matching of two points. (11) Similarly, Morse and Patel developed a sequence-weighted alignment model (Swale) for TSC. (12) DWT, EDR, ERP, LCSS, and Swale are examples of elastic measures. In addition to lock-step measures and elastic measures, there are also threshold-based measures, i.e., threshold queries (TQuEST) (13) and pattern-based measures, such as the spatial assembling distance (SpADe). (14) A summary of distance-based measures is listed in Table 1. Although various variants have been researched, they were still distance-based measures or editing distance measures. The measures may work well for simple TSC with low-dimension data; however, they still have difficulty with complex TSC.
For feature-based classification, some basic approaches, such as decision trees and neural networks, and some approaches for feature selection have been studied and adopted to classify feature vectors. Then, the sequence classification of TSC is solved by transforming the sequence through the results of feature selections. (15) Chuzhanova et al. proposed a gamma-testbased feature selection for the sequence classification problem. (16) Ji et al. studied an approach using contract sequences with a gap miner strategy to mine the distinguishing subsequence satisfying the constraints. (17) Nanopoulos et al. presented a feature selection approach for TSC with the help of a multilayer perceptron (MLP) neural network. (18) Yoon et al. studied a novel unsupervised method based on the common principal component analysis strategy to select suitable features. (19) The key factor of feature selection is the criteria used to select the features. Eads et al. stressed that the most difficult and important part of feature selection is selecting appropriate features. There is necessarily always a trade-off between manual selection and the help of domain experts. (20) Moreover, it is difficult to design good features to capture intrinsic properties embedded in various time series data. Therefore, the accuracy of feature-based methods is usually lower than that of sequence-distance-based ones, particularly 1-nearest neighbor (1-NN) with DTW. On the other hand, 1-NN and DTW have been used in many studies, but both require too much computation for many real-world applications. (7) In fact, the techniques mentioned above usually depend on handcrafted features that require researchers to have sufficient professional knowledge and practical experience. Furthermore, even if they need a lot of time and labor, there may still be inevitable deviation during classification. Thus, various researchers have focused on effective approaches based on data mining tools in recent years.
As one of the supervised learning models, SVM is a kind of generalized linear classifier for binary data classification. In recent years, many variants of basic SVM have been applied for TSC. Kampouraki et al. investigated the potential benefit of Gaussian kernel-based SVM on heartbeat TSC. (21) Eads et al. proposed an algorithm called Zeus for TSC, which employed evolutionary computation for feature extraction and SVM for classification. (22) Alalshekmubarak and Smith changed the output layer of SVM by replacing the linear readout function with the radial basis function kernel, and proposed a novel algorithm that combines SVM and an echo state network for TSC. (23) Rodríguez and Alonso attempted to combine SVM and a boosting algorithm to analyze interval features for time series classification. (24) On the basis of a temporal extension of discrete SVMs, Orsenigo and Vercellis proposed a new algorithm with the benefit of a warping distance and a softened variable margin. (25) Although SVM has strong ability and flexibility in data mining for various applications, it is difficult to interpret the results and difficult for users to gain knowledge other than the classification result, especially for kernel-based methods.
With the increased availability of time series data, the effectiveness of TSC faces enormous challenges. Recently, deep learning (DL) has successfully been applied in various classification tasks. This is because DL can learn a hierarchical feature representation from data automatically instead of preparing the features manually. The following are some typical DL approaches for TSC.
The MLP is a common feedforward artificial neural network model. It maps multiple input datasets to a single output dataset and adjusts parameters through an error back-propagation algorithm. Fawaz et al. studied several state-of-the-art DL algorithms for TSC and proposed an open-source DL framework for the TSC community. (26) Zheng et al. employed a DL framework to improve feature learning techniques to solve multivariate time series classification. (27) Nanopoulos et al. constructed an improved approach based on the MLP for multivariate time series. (28) Batres-Estrada applied a DL framework to multivariate financial time series and demonstrated the effectiveness of the MLP in TSC. (29) A recurrent neural network (RNN) can describe dynamic time behavior because, unlike feedforward neural networks that accept input from more specific structures, RNNs pass states in their own networks, thus accepting a wider range of time series structure inputs. The main purpose of an RNN is to process and predict sequence data. It is used to model sequence data, which means that the current output of a sequence is also related to the previous output. Because of its network structure, an RNN will remember previous information and use it to affect the output of the following nodes.
Although RNNs have many advantages, there is an obvious problem: long-term dependence. As a kind of RNN, long short-term memory (LSTM) can learn long-term dependence information and has a gate mechanism to control the flow and loss of features to avoid the long-term dependence problem. Since LSTM was proposed by Hochreiter and Schmidhuber in 1997, (30) many researchers have contributed to the modern LSTM, such as Felix Gers, Fred Cummins, and so on, and a complete system for LSTM has been formed. Owing to the unique design structure of LSTM, it is suitable for processing and predicting important events with very long intervals and delays in time series. Lipton et al.  (34) XGBoost is an open-source software library that provides a gradient-boosting framework. From the project description, it aims to provide a "scalable, portable and distributed gradient boosting (GBM, GBRT, GBDT) library." In addition to running on a single machine, it also supports distributed processing frameworks such as Apache Hadoop, Apache Spark, and Apache Flink. It has gained much popularity and attention recently as it was the algorithm of choice for many winning teams of a number of machine learning competitions. Zheng et al. proposed a short-term load forecasting method using EMD-LSTM neural networks with an XGBoost algorithm for feature importance evaluation. (35) Chen et al. proposed a radar emitter classification for large datasets based on weighted XGBoost. (36) The convolutional neural network (CNN) is a variant of the MLP, which was developed by biologists Huber and Wiesel in their early research on the cat visual cortex. The first CNN, LeNet-5, was proposed by Lecun and Bottou in 1998. (37) As a kind of feedforward neural network with a depth structure, a CNN contains a convolution calculation and is one of the representative algorithms of DL. Morabito et al. generated suitable sets of features with the help of the representational power of a CNN. (38) Zheng et al. proposed a novel DL framework for multivariate TSC. (39) Yang et al. proposed a systematic feature learning method for the HAR problem that adopts a deep CNN to automate feature learning from the raw inputs in a systematic way. (40) Researchers found that the cooperation and combination of a CNN and LSTM can have good performance. Zhou et al. proposed a combination of a CNN and LSTM for text classification. (41) Wang et al. proposed a beyond frame-level CNN, which is a saliencyaware 3D CNN with LSTM for recognition. (42) Wu and Prasad proposed a novel algorithm based on a CNN with the help of LSTM for hyperspectral data classification. (43) However, time series datasets are often mixed with strong noise. For example, noisy environments will reduce the effect of natural language processing, and water droplets falling on a camera on rainy days can reduce the accuracy of video detection. At present, the design of network structures is mostly oriented to the kind of problem instead of the kind of dataset. As a result, the performance of a network may vary greatly on different datasets for the same kind of problem. To solve this problem, we attempt to adjust the network structure according to the characteristics of the dataset in the training process to make up for the lack of a traditional learning model. In this paper, we combine a genetic algorithm (GA) with a CNN and propose a hybrid model (GACNN), in which the CNN is trained for a certain number of epochs, and then its structure is adjusted by the GA.
The rest of the paper is organized as follows. In Sect. 2, we present the problem description and the model of TSC. Our proposed GACNN is introduced in Sect. 3. Section 4 presents numerical experiments, and Sect. 5 presents the final conclusion.

Description of Problem
TSC can be defined as a classification problem with a series of datasets indexed in time order. Three kinds of time series are defined: (2) • A single time series: In the training phase, the objective of TSC is to find the correspondence between windows {S i:w } and K feature classes using a learning model. Here, we set a probability distribution over K classes with each label value of y j ∈ [1, K], j = 1, 2, ..., K. An illustration of TSC is shown in Fig. 1. Given a sequence of values for a time series dataset D, values at multiple time steps can be grouped to form an input vector (generally provided by an expert). Algorithm 1 shows a general implementation process of a TSC learning algorithm.  Fig. 2.
In general, a doctor can judge the patient's sleep staging and apnea condition on the basis of the PSG signals. There are six types of apnea conditions as follows: H Hypopnea HA Hypopnea with arousal OA Obstructive apnea X Obstructive apnea with arousal CA Central apnea CAA Central apnea with arousal Solving TSC has great significance, such as the above SAS classification by PSG. However, in actual situations, the sampling data are mixed with strong noise. This makes the learning of data characteristics and the mining of features from sampling data too difficult. As shown in Fig. 3, there are significant differences in the data in different situations. Therefore, the learning model must be robust to unreliable data. In this paper, we propose a hybrid CNN model combined with a GA for TSC with unreliable data. To obtain a robust CNN network structure, we design a GA for network structure optimization.

GACNN
A GACNN aims to find a robust DL model to fit unreliable data. In this paper, the GACNN focuses on TSC problems, where the TSC data are mixed with strong noise. The GACNN considers both the CNN and GA as its basic algorithms and is suitable for TSC and network structure optimization. First, we design and train a full CNN using TSC sampling data. After a certain number of training steps, we adopt the GA to adjust the CNN structure by cutting some connections between neurons. By repeating the above processes alternately, the GACNN can obtain a more efficient DL model. A flowchart for GACNN is shown in Fig. 4, and its overall  algorithm is shown in Algorithm 2. All details regarding the GACNN are introduced separately in the following.
Train the CNN for TSC; 4.
for CNN structure optimization step do 5.
end for 8.
return fine-tune the remaining CNN; 9. End

CNN for TSC
A deep neural network is a composition of L layers of a bipartite graph with weighted and directed arcs. Each layer l i , i ∈ (1, L) contains neurons, takes the output of the previous layer l i−1 as the input, and applies an activation function to compute its output: where f i corresponds to the activation function applied at layer l i using weighted parameters θ i and input values x.
The CNN is one of the classical deep neural networks and is most commonly applied to computer vision. Recently, CNNs have been successfully applied in various fields, including TSC. (1) To solve TSC problems, the convolution is defined as applying and sliding a filter over the time series. The filter of CNN expresses time series as one dimension or multiple dimensions. A general function applying the convolution for a time stamp t is given as where C t denotes the convolution applied to the dataset of time series D = {(X t , y t )} on time stamp t. X t F is calculated from X t using a filter F. An example of a CNN structure for TSC with four convolutional layers is illustrated in Fig. 5.

GA-based CNN structure optimization
GAs have attracted wide attention because of their intelligence, parallelism, robustness, good adaptability, and the capability of global searching. A GA is a generic population-based metaheuristic optimization algorithm that uses some mechanisms inspired by biological evolution: mutation, crossover, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the environment within which the solutions 'live'.
An optimized CNN structure G' can be obtained as (N, W·U), where U = {u k }, u k ∈ (0, 1) is a transition matrix. The structure optimization can be formulated as follows: where ||·|| represents the loss between the original structure G and the optimized structure G', and λΩ(U) is the penalty function of the transition matrix U. Formally, the objective function used in CNN structure optimization can be formulated as follows: In the above description, this structure optimization has a computational complexity of O(2 n ) and is an NP-hard problem. The procedure of CNN structure optimization is shown in Algorithm 3.

Representation
How to encode a CNN structure into a chromosome is a key issue of GAs. In this paper, the main objective is to determine the optimal combination of {u k ∈ (0, 1)}, and the representation is carried out using binary strings. A gene v k (w ij ) can be seen as a switch that controls the existence of a connection from node i to node j. If the state of v k (w ij ) is 1, then neuron i is connected with neuron j in the CNN, and if the state of v k (w ij ) is 0, then the connection between neuron i and neuron j is deleted from the CNN. An example of a representation is shown in Fig.  6 and a decoded result is shown in Fig. 7.

Evolution
To search for an effective network structure, we evaluate a representation by the following evolution function: where α and λ are adaptive weight parameters with adjustable weights, in order to adjust the search direction between the loss function and network structure.
The weights are defined as follows, where r is a nonnegative random number in the interval [0, 1].

Mutation
Mutation is a genetic operator that produces spontaneous random changes in various chromosomes. In the GA, mutation serves the crucial role of either (a) replacing the genes lost from the population during the selection process so that they can be tried in a new context or (b) providing genes that were not present in the initial population. Mutation selects a gene at random; since the gene is 1 (or 0), it will be flipped to 0 (or 1). An example of mutation is shown in Fig. 8.

Overall procedure of GA
The implementation procedure of the GA is described in Algorithm 4, where P(t) and C(t) are parents and offspring, respectively, in the current generation t. Algorithm 4: GA. 1. initialize P(t) by random generated structure U; 2. fitness eval(P) by evolution function; 3. while not termination condition do 4.
for each pair do 6. compare fitness; 7.
select the better individual as offspring C(t); 8.
fitness eval(C) by evolution function; 11. end while 12. output the best individual;

Experiments
We apply our GACNN to the MIT-BIH polysomnographic database (44,45) and measured PPG signals, and compare its performance with six state-of-the-art models: SVM, MLP, LSTM, CNN, CNN+LSTM, and XGBoost. All algorithms are implemented on TensorFlow 1.10 with cuDNN 7.0.

Datasets
The MIT-BIH polysomnographic database is provided by PhysioNet (www.physionet.org). It is a dataset of multiple physiologic signals during sleep. The dataset was recorded in Boston's Beth Israel Hospital Sleep Laboratory, where subjects were monitored to evaluate SAS and test the effectiveness of constant positive airway pressure. The dataset contains 4/6/7-channel polysomnographic recordings with ECG signals annotated beat-by-beat, and EEG and respiration signals annotated with respect to sleep stages and apnea. The dataset consists of 18 records with 16 male subjects, aged 32-56 years (avg. age 43 years) weighing 89-152 kg (avg. weight 119 kg). In this paper, the sampling frequency of the selected pulse wave signal is 250 Hz. The annotation period of the apnea signal is per 30 s, there are a total of 9602 labels, and the data dimensions are 9602 × 7500.
MIT-BIH with noise synthesis is considered to simulate uncertainty in signal acquisition. We synthesize the dataset slp67x with noise, and the sampling frequency is 250 Hz. A set of reference signals is shown in Fig. 9(a), which is recorded as 10 s data strings starting from 1 h 17 m. First, we superimpose low-frequency mixed sine and cosine signals on the reference signals to simulate baseline drift noise, called MIT-BIH-dn [ Fig. 9(b)]; then we superimpose white Gaussian noise with a 10 dB signal-to-noise ratio and power line interference on the reference Actual PPG data are recorded from seven males and three females with age 26.20 ± 5.14 (mean ± std) and BMI 21.79 ± 3.40 (mean ± std). We record multiple sets of PPG signals, including a raw signal with the subject sitting calmly (PPG), a raw signal with the subject moving slightly (PPG-ms), a raw signal acquired under strong noise (PPG-sn), and a raw signal with multiple noises (PPG-mn).

Experimental settings
For CNN and MLP, training is done using mini-batch SGD with a momentum of 0.9, and the batch size is set to 64 for all the networks. Networks are trained for a total of 100 epochs; we start from a learning rate of 0.01 and divide by two every 10 epochs. For MLP, a dropout is applied after every hidden layer, and the dropout rate is set to 0.2.
For LSTM and CNN-LSTM, we train the network using Adam with a batch size of 64. Both networks are trained for a total of 50 epochs. For SVM and XGBoost, we directly use the raw time series samples as the input. SVM with a linear kernel is used as the classifier and the penalty C is set to 0.4. For XGBoost, the maximum depth is set to 20 and the maximum number of iterations is set to 30. For GACNN, the training schema is similar to that of CNN and MLP, using mini-batch SGD with a momentum of 0.9. We apply a GA search at the 30th epoch to adapt the structure of the network once. The population size is set to 50 with evolution for 100 generations.

Results on 2-type classifiers
We perform experiments focusing on 2-type classifiers with MIT-BIH benchmarks and PPG actual datasets. Firstly, we execute GACNN and six other algorithms on two conventional datasets: MIT-BIN 2-classifier (results shown in Table 2) and PPG 2-classifier (results shown in Table 3). The CNN-based algorithms (CNN, CNN+LSTM, and GACNN) achieve better performance than SVM, MLP, LSTM, and XGBoost, except for CNN+LSTM, which has lower performance than MLP on the MIT-BIH 2-classifier. Our GACNN achieves the best performance. Furthermore, after optimizing the CNN network structure, about 66% and 44% of connections are pruned using the GA, thus increasing the effectiveness of the algorithm.   Table 4) and PPG 2-classifier with noise (results shown in Table 5). Six noise experiments are performed. Again, our GACNN achieves the best performance in all experiments. The results of the experiments are summarized in Fig. 10.

Results on 7-type classifications
Next, we perform experiments focusing on the 7-type classifications with actual PPG datasets. First, we execute GACNN and five other algorithms on the conventional dataset PPG 7-classifier (results shown in Table 6). Then, three noise experiments are performed to compare the effectiveness of the six algorithms under unreliable data conditions (results shown in Table 7). Our GACNN achieves the best performance in all experiments. The results of the experiments are illustrated in Fig. 11.

Conclusion
In this paper, we proposed a hybrid CNN model combined with a GA to find robust solutions to time series classification problems. As discussed in the paper, in the CNN training process, the network structural adjustment achieved effective results with robust time series classification. Although this network structural optimization is an NP-hard problem, our structure optimization approach by the GA showed outstanding performance in solving this problem. Benchmarks and actual datasets were adopted and tested to prove the effectiveness of the proposed GACNN. The numerical results showed that our approach had superior performance to six state-of-the-art deep neural networks.