Hybrid Algorithm Based on Simulated Annealing and Bacterial Foraging Optimization for Mining Imbalanced Data

The bacterial foraging optimization (BFO) algorithm can simulate the mechanism of natural selection. However, as the direction of inversion is uncertain in the chemotaxis process, it easily falls into a local optimum. We propose a hybrid algorithm based on simulated annealing (SA) and BFO for mining imbalanced data. The key idea is to exploit the advantages of both SA and the BFO algorithm. In the proposed algorithm, SA finds the optimal solution by employing a jump process, so as to solve the uncertainty of the reversal direction in the chemotaxis process of BFO and avoid falling into a local optimum. SA is used to improve the chemotaxis process of BFO, and then the swarming process, reproduction process, and elimination– dispersal process of BFO are implemented. Four imbalanced datasets are used to test the performance of the proposed hybrid algorithm. In each imbalanced dataset used for testing, there is a certain correlation between the variables, making the dataset multivariate. Through the proposed algorithm, these four multivariate imbalanced datasets are effectively classified, and its performance compared with that of other algorithms. Experimental results show that for the different multivariate imbalanced datasets, the proposed algorithm is better than the original BFO algorithm in terms of various performance indicators. By combining the proposed algorithm with sensor-related technology, in the future, medical multivariate data and security monitoring system data obtained by sensors can be analyzed to improve the classification accuracy of multivariate data.


Introduction
Classification is important in data mining. Classification refers to the establishment of a data classification model based on the known data and its class attributes. The classification model classifies the data according to given classes in a database to predict new data. Most of the data used for disease diagnosis, face recognition, text classification, and financial risk prediction are imbalanced. When a traditional algorithm is used to solve such problems, the result of classification tends to the majority classification, which leads to the minority classification not being correctly recognized. However, in many practical applications, a few samples are more valuable than most other samples. (1,2) Thus, classification of imbalanced data is an important research topic in machine learning and data mining as the accuracy of algorithms depends on how correctly the data are classified. Data mining for imbalanced data can be performed using a decision tree (DT), artificial neural network (ANN), genetic algorithm (GA), and support vector machine (SVM). (3)(4)(5)(6) Some methods such as the cost-sensitive classifier and snowball methods have been proposed to process imbalanced data. (7) The cost-sensitive classifier method focuses on minimizing misclassification costs as well as other types of cost with overspecific rules. The snowball method uses an ANN to learn the rules from the instances of minority classes, and then those of majority classes are added gradually when the ANN works dynamically. Unfortunately, the snowball method is only effective on some particular ANNs. (8) The bacterial foraging optimization (BFO) algorithm is a heuristic swarm intelligence optimization algorithm proposed by Professor Passino of Ohio State University in 2002. (9) Its theory is based on the foraging behavior of E. coli. Based on chemotaxis, swarming, reproduction, and elimination-dispersal processes, the BFO has satisfactory performance in solving optimization problems. (10,11) However, during the chemotaxis process, the BFO depends on random search directions, which may delay reaching the universal solution. Recently, the combination of BFO with other algorithms to solve optimization problems has been proposed. Compared with a single method, a hybrid method involving BFO and other algorithms yields better results from various systems, resulting in improved optimization performance. In the simulated annealing (SA) algorithm, the Metropolis acceptance criterion is adopted. (12) The basic idea of SA comes from physical annealing. Starting from a high temperature, a constant decrease in the temperature leads to a random search for the global optimal solution of an objective function. That is, under the premise of obtaining the local optimal solution, the global optimal solution can be obtained.
In this paper, a hybrid algorithm involving SA and BFO for mining imbalanced data is proposed to solve the problem of the local optimum in the chemotaxis process of the BFO algorithm. In the proposed algorithm, SA finds the optimal solution of the location by employing a jump process to solve the uncertainty of the reversal direction in the chemotaxis process of BFO and avoid falling into a local optimum. After SA finds the optimal location, it then performs the BFO processes of swarming, reproduction, and elimination-disposal to improve the classification accuracy. The purpose of this study is to improve the classification accuracy of imbalanced data through the proposed hybrid algorithm of SA and BFO and to solve the problem that the original BFO easily falls into a local optimum. Four imbalanced datasets are used to test the performance of the proposed hybrid algorithm. In each imbalanced dataset used for testing, there is a certain correlation between the variables, making the dataset multivariate. Through the proposed algorithm, these four multivariate imbalanced datasets are effectively classified, and its performance compared with that of other algorithms. By combining the proposed algorithm with sensor-related technology, in the future, medical multivariate data and security monitoring system data obtained by sensors can be analyzed to improve the classification accuracy of multivariate data.
In Sect. 2, we first briefly review BFO and SA. The proposed algorithm is presented in Sect. 3. Simulation results are analyzed and discussed in Sect. 4. Finally, Sect. 5 concludes the paper.

BFO
The BFO algorithm is optimized for random searches. Its mathematical model has four main basic steps: chemotaxis, swarming, reproduction, and elimination-dispersal. (13) The foraging behavior of bacteria is mainly based on these four operations. In the chemotaxis process, E. coli has two basic movements in foraging: swimming and tumbling. In general, bacteria roll more often in areas with poor environmental conditions and swim more often in better environments. When where H kk (θ, Q( j, k, l)) is the penalty added to the actual cost function, S is the number of bacteria, θ m is the location of the fittest bacterium, d attract is the depth of attraction, w attract is the width of attraction, h repellent is the height of repulsion, and w repellent is the width of repulsion.
In the swarming process, the number of biologically motivated choices is expressed as N s . The formula used to calculate the fitness value of the swarming operation can be expressed as ( , , , ) ( , , , ) ( , ( , , )) swarm kk The rule of biological evolution in nature is survival of the fittest. In the reproduction process of BFO, bacteria after foraging are sorted by their energy value using the cost function H. S/2 of the bacteria with the smallest energy values are eliminated. The remaining bacteria (S/2) are then reproduced by replication. The newly replicated bacteria have the same foraging ability as the original bacteria. The reproduction operation maintains the invariance of the population size. After N re reproduction steps, elimination-dispersal occurs according to a certain probability P ed , where N ed is the number of steps of elimination-dispersal. An individual undergoing the elimination-dispersal process dies and a new individual is randomly generated at any location in the solution space. These new bacteria have a random characteristic and may have a different foraging ability from the original bacteria. This random characteristic results in a new population with a jump in the local optimal value, making it closer to the global optimal solution. The main steps of the BFO algorithm are as follows.
Step 1: Initialization parameters are obtained: population size S, number of times of chemotactic behavior N c , maximum number of steps forward in chemotactic operation N s , number of reproduction operations N re , number of elimination-dispersal operations N ed , elimination-dispersal probability P ed , chemotaxis step α(i).
(2) Calculate H(i, j, k, l) and store the optimal value H best . (4) Calculate the fitness value H(i, j + 1, k, l) according to the information of bacteria θ i ( j + 1, k, l). (5) Rotation judgment condition, m represents the counter of consecutive swimming. If m < S s , m = m + 1′; if H(i, j + 1, k, l) > H best , then H best = H(i, j + 1, k, l). This calculates a new H(i, j + 1, k, l) according to the information of bacteria θ i ( j + 1, k, l), until m reaches N s . (6) Repeat the process for the next bacteria.
Step 5: If j < N c , return to step 4 for the bacterial chemotaxis operation.
Step 6: Reproduction: remove S/2 bacteria with the lowest energy values and replicate the remaining S/2 bacteria. Step 7: If k < N re , return to step 3.
Step 8: Elimination-dispersal: when certain conditions are met, bacteria find food again with probability P ed . If l < N ed , return to step 2; otherwise, end the optimization.
Step 9: Has the maximum number of BFO iterations been reached? If so, the result H(i, j, k, l) is output.

SA
The SA algorithm is heuristic and simulates the physical process of cooling of a classical particle system in thermodynamics. Kirkpatrick et al. first proposed the SA algorithm in 1983. (14) When the temperature T of an isolated particle system decreases very slowly, it is considered that the system is in thermodynamic equilibrium, and the energy of the system is the lowest. In the SA algorithm, the Metropolis acceptance criterion is used to determine all the values of the control temperature parameter T. That is, the iterative process of "generating a new solutionmaking judgment by accepting or discarding" is repeated, and finally the system is found at temperature T under the equilibrium with the optimal solution obtained. The SA algorithm flow is as follows.
(1) Determine the initial temperature T 0 , the final temperature T f , and the starting point x 0 , and get the function value f(x 0 ), the Metropolis number of the iteration M iter , and the temperature cooling rate λ, 0 < λ < 1. Otherwise, the original point is still used as the starting point for the next operation. (5) The temperature is gradually reduced to T ← λT, where 0 < λ < 1, and the above process is repeated until the specified end condition is reached.

Proposed Algorithm
The purpose of this study is to improve the classification accuracy of mining imbalanced data by using an effective algorithm, that is, a hybrid algorithm based on SA and BFO. The algorithm solves the problem of falling into a local optimum of the original BFO. We propose to insert SA into the chemotaxis process of BFO, use the characteristics of the SA probability to get rid of the local optimum, and then improve the classification accuracy of the original BFO. This is the main innovation of this paper. Four datasets were used for testing the performance of the proposed hybrid algorithm: an E. coli dataset, a zoo dataset, a spam email dataset, and a Pima Indian diabetes dataset in the University of California Irvine (UCI) repository. (15) The E. coli dataset had a total of 334 instances in eight features with an imbalance ratio of the data of about 1:15.8 ( Table 1). The zoo dataset had 101 instances in 17 features (Table 2) with an imbalance ratio of 1:25. The spam email dataset had 4601 e-mails in 58 features with an imbalance ratio of 1:1.54 (Table 3). The Pima Indian diabetes dataset had 768 instances in nine features with an imbalance ratio of 1:2.34 ( Table 4). The flow chart of the hybrid algorithm is shown in Fig. 1. With the set of used parameters, the borderline synthetic minority oversampling technique (borderline-SMOTE) and the Tomek link were used to preprocess data. Then, the SA algorithm was used to improve the chemotaxis operation in BFO to classify the imbalanced data and solve the shortcoming of the BFO algorithm of falling into a local optimum.
The basic idea of the hybrid algorithm with borderline-SMOTE is to determine borderline minority instances, apply a SMOTE algorithm to generate synthetic instances to oversample the minority class, and finally balance the datasets. (18) To create the hybrid algorithm, we used the Euclidean distance to find the k nearest neighbors of the instance x i ∈ S min , where S min is the Table 3 Fifty-eight features of spam email dataset. (  toothed nominal minority classes, x i ∈ {1, ..., n}, and n is the number of minority instances. Here, i x ∈ {1, ..., k} is one of the k nearest neighbors. Next, we randomly selected x i from the k nearest neighbors, and then generated a random number P rand ∈ [0, 1]. Finally, we used Eq. (4) to generate a new instance and repeated the previous step until the number of instances balanced.
The Tomek link reduces the impact of class overlap on classification performance. (19) The basic idea is as follows. Given a pair of instances (x i , x j ), where x i belongs to the majority class and x j to the minority class, the distance between the two points is defined as d( x j ), then x i and x j form a Tomek link pair. In this study, the parameter k used for SMOTE was set to k = 3. After preprocessing the data, θ i was generated. Thereafter, the BFO of chemotaxis, swarming, reproduction, and eliminationdispersal is iterated. To solve the uncertainty of the reversal direction in the chemotaxis process of BFO and avoid the local optimum, SA was incorporated into BFO. As SA was added to the chemotaxis process of each individual bacterium, the cost of the bacterium was decided according to SA. The process of the hybrid algorithm was as follows.
Step 1: In the chemotaxis process, the SA algorithm begins with four parameters: M iter , T 0 , T f , and λ. M iter denotes the maximum number of iterations, T 0 represents the initial temperature, T f is the final temperature at which the proposed algorithm stops as the temperature decreases, and λ is the coefficient controlling the cooling rate. The current temperature T is set to be the same as T 0 . The solution is represented as features in the dataset followed with θ i as shown in Fig. 2. An initial solution τ is generated according to the representation of the solution in Fig. 2. For each generation, the next solution η is generated from τ by randomly swapping features and randomly generating θ i in the current solution. Let obj(τ) denote the testing classification accuracy of τ and Δ denote the difference between obj(τ) and obj(η). That is, Δ = obj(τ) − obj(η). If Δ ≤ 0, the probability of replacing τ with η is 1, where τ is the current solution and η is the next solution. Meanwhile, if Δ > 0, the probability of replacing τ with η is e −Δ/T . This is achieved by generating a random number r rand ∈ [0, 1] and replacing the solution τ with η when e −Δ/T > r rand . The hybrid algorithm is repeated until T is lower than T f . Thereafter, the SA obtains the best solution in the chemotaxis process.
Step 2: In the swarming process, Eq. (3) is applied to evaluate the cost H kk .
Step 3: In the reproduction process, S r (= S/2) bacteria with the lowest costs H die and the other bacteria with the highest costs H split into two bacteria at the same location.
Step 4: In the elimination-dispersal process, the new θ i obtained by SA is generated with the probability P ed . The hybrid algorithm is repeated until the maximum number of BFO iterations is reached. Otherwise, the process goes back to Step 1.
Step 5: When the maximum number of BFO iterations is reached, the BFO stops. Finally, the classification accuracy result is reported. In the multivariate imbalanced data, there is a certain correlation between the variables. The classification accuracy is used in multivariate data analysis, which indicates that the classification effect is good, but for the classification of imbalanced data, if the model returns its classification result as the major class, the classification accuracy can be high. Therefore, for the classification of imbalanced data evaluation indicators, this paper also utilizes precision, recall, f1 score, the receiver operating characteristic (ROC) curve, and the area under the curve (AUC). These performance indicators are calculated on the basis of the confusion matrix shown in Table 5, where TP is the number of instances that are predicted to be positive and are actually positive, TN is the number of instances that are predicted to be negative and are actually negative, FP is the number of instances that are actually negative but are predicted to be positive, and FN is the number of instances that are actually positive but are predicted to be negative. In this paper, classification accuracy reflects the classifier's ability to judge the entire instance as positive or negative. The classification accuracy is given by The precision rate represents the proportion of correctly predicted positive instances out of all positive predictions determined by the classifier. The precision rate calculation formula is ( ) The recall rate represents the proportion of positive instances predicted to be positive instances. The recall formula is The f1 score is a measure used in classification problems. It uses the harmonic average method to comprehensively consider the precision rate and the recall rate, and its maximum is 1 and minimum is 0. The f1 score calculation formula is 2 1 Precision Recall f score Precision Recall In order to verify the performance of the model, the evaluation criterion used in this experiment is AUC. The AUC value is expressed as the area under the ROC curve. The larger the AUC value, the more effective the model. The equation for AUC is where i i positiveclass Rank ∈ ∑ represents the sum of the sequence numbers for positive instances, Rank i is the sequence number of the ith instance, M denotes the number of positive instances, and N denotes the number of negative instances.

Setting of experimental parameters
The simulations required the parameters for the algorithm. The parameters of BFO used in this study were S = 50, N c = 100, N s = 4, N re = 4, N ed = 2, P ed = 0.25, d attract = 0.05, h repellent = 0.05, w attract = 0.05, w repellent = 0.05, and α(i) = 0.1, i = 1, 2,…, S. The number of BFO iterations was 800 (N c × N re × N ed = 100 × 4 × 2 = 800). SA was used to solve the uncertainty of the reversal direction during the chemotaxis of BFO and avoid falling into a local optimum. As the parameters of SA, the maximum number of iterations was M iter = 5000, the initial temperature was T 0 = 100, the final temperature was T f = 0.01, and the cooling rate was λ = 0.95. The evaluation model of the test dataset was verified by 10-fold cross-validation, and 90% of the data was used as training data and the rest used as testing data.
The main key to algorithm performance and efficiency is the parameter setting of the algorithm. As BFO can have many parameters, the optimal parameters are required to optimize the performance of the algorithm. For this, the analysis of parameters and the convergence and computational complexity are as follows.
(1) The population size S affects the performance of the BFO. When the population is small and the calculation speed of the BFO is high, the diversity of the population is reduced, which affects the optimization performance of the algorithm. The higher the value, the better the algorithm avoids falling into a local optimal value. However, when the population is too large, the number of calculations of the algorithm increases and the convergence speed of the algorithm decreases. (2) The larger the value of N c , the number of executions of the chemotaxis operation, the more detailed the search of the algorithm. However, the complexity of the algorithm also increases. On the contrary, the smaller the N c , the more easily the algorithm falls into a local optimal value. Then, the performance becomes more dependent on the operation. (3) The larger the value of N re , the number of iterations, the higher the algorithm's speed. Of course, a very large N re increases the complexity of the algorithm, while a very small N re makes the algorithm converge prematurely. (4) N ed is the number of executions in the elimination-dispersal operation. If N re is too small, the algorithm does not randomly search the area in the elimination-dispersal operation. On the contrary, the larger the N ed , the larger the area that the algorithm searches and the greater the diversity of the solution. Then, the algorithm is prevented from falling into precocity and the complexity of the algorithm also increases. Choosing an appropriate value for the elimination-dispersal probability P ed helps the algorithm jump out of a local optimal value, but if the value of P ed is too large, BFO becomes a random search algorithm. The advantage of a heuristic search algorithm is that one set of solutions is obtained in one run, reducing the time and computational cost required to find the ideal solution and achieve good results. In this study, the SA algorithm solved the problem of the local optimization of the original BFO by combining it with the BFO chemotaxis operation. This improved the original BFO chemotaxis operation.

Comparison of experimental results with other algorithms
We tested the following algorithms for comparison with the hybrid algorithm: SVM, DT, k-nearest neighbor (KNN), back-propagation network (BPN), and BFO. SVM easily performs nonlinear classification by replacing the kernel. For the radial basis kernel function, the parameters of SVM used in this study were a penalty of 1 and a gamma of 0.1. DT is a decision support tool for graphics or decision models such as trees. The parameters of DT used in this study were a minimum case of 2 and a pruning confidence factor of 0.1. KNN is a simple machine learning method that classifies according to the distance between different feature values. The KNN parameter used in this study was k = 3 and the Euclidean distance was used. BPN is a learning method used in many neural networks, and its network behavior is based on the training data of input/output patterns. It is suitable for applications in diagnosis, prediction, classification, and other problems. In this paper, BPN used the sigmoid function in its hidden layer, namely, f(x) = 1/(1 + e −x ), the hidden layer node was set to 15, and the output layer used the linear function f(x) = x. The learning rate was 0.05 and the maximum number of iterations was 25000. The BFO algorithm is described in Sect. 2.1.
(1) From Table 6, the average classification accuracy in the proposed algorithm for the E. coli dataset is 97.61%. The average classification accuracies for the zoo, spam email, and Pima Indian diabetes datasets in the proposed algorithm are 99.55, 96.32, and 97.66%, respectively. It can be seen from Table 6 that the proposed algorithm in this paper has a higher classification accuracy than the other algorithms for the four datasets. This is because the performance of the classification for these tested datasets is found on the basis of heuristic information. In fact, the proposed approach also has a similar performance, so it performs well in terms of classification accuracy. for the E. coli, zoo, spam email, and Pima Indian diabetes datasets are better than those for the original BFO method of 90.12, 94.36, 95.25, and 93.64%, respectively. This is because the proposed algorithm adds SA to improve the chemotaxis process, that is, SA is added to the chemotaxis process of each individual bacterium. Owing to the ability of probabilistic jumping, the proposed algorithm overcomes the problem that the original BFO easily falls into a local optimum during the chemotaxis process, and then performs the swarm process, reproduction process, and elimination-dispersal process of BFO, so it has better classification accuracy. algorithm (RFA) to analyze asthma data. (20) In the RFA-BFO algorithm, the classification accuracy of the UCI zoo multivariate imbalanced data used for testing was 99.5%. In Table 6 Classification accuracy of different algorithms. this paper, the proposed algorithm also utilizes the same zoo dataset and has an accuracy of 99.55% (to two decimal places). RFA-BFO and the proposed algorithm have good classification results for imbalanced data. In the RFA-BFO algorithm, RFA with the property of robustness can reduce the influence of noise or outliers. It can establish a fuzzy model and analyze multivariate imbalanced data. However, the disadvantage of RFA based on experience is that the simple fuzzy processing of information may reduce the accuracy of data classification. On the other hand, in this study, SA is embedded in the chemotaxis of BFO to solve the problem of BFO easily falling into a local optimum. That is, SA is added to the chemotaxis process of each individual bacterium. In the chemotaxis process, SA is performed to obtain the updated location of the solution, and then the swarming, reproduction, and elimination-dispersal processes of BFO are performed. In the elimination-dispersal process, a new location of the solution is generated by SA according to the probability P ed . Finally, when the criterion is satisfied, the classification accuracy results are output. Our purpose is to obtain an efficient search algorithm for multivariate imbalanced data, and we do not intend to discuss which of the two methods, RFA-BFO or the proposed algorithm, is better. (4) To verify that the model does not return the classification results to the majority class to increase the classification accuracy, we calculate the precision, recall, and f1 score to obtain the results in Table 7. It can be seen from Table 7 that for the four datasets, although the imbalance rate is different, the proposed algorithm shows balance between the majority class and the minority class in the classifier training. The f1 score of the proposed algorithm is also better than that of the other algorithms.

AUC evaluation
In signal detection theory, the value of AUC is between 0 and 1, and the larger the value, the better the model. The experimental results in Table 8 are the AUC values of the hybrid method with the four datasets. The AUC of the E. coli dataset was 0.974 as shown in Fig. 3. The AUC values of the other three datasets were 0.993 (zoo, Fig. 4), 0.997 (spam email, Fig. 5), and 0.963 (Pima Indian diabetes, Fig. 6). The AUC for each dataset exceeded 0.96, which proved the effectiveness of the hybrid algorithm in this study.

Conclusions
A hybrid algorithm based on SA and BFO for mining imbalanced data was proposed in this study. For the preprocessing of imbalanced data, we utilize borderline-SMOTE and the Tomek link. Because the SA algorithm has the characteristic of a jumping process based on probability, it can effectively avoid falling into a local optimum during the search process. The proposed hybrid algorithm adopted the BFO algorithm with the characteristics of SA to effectively solve the uncertainty of the chemotaxis process. Four multivariate imbalanced datasets (E. coli, zoo, spam email, Pima Indian diabetes) and other algorithms (SVM, DT, KNN, BPN, and BFO) were used for testing and comparison with the performance of the hybrid algorithm. The average classification accuracy of the hybrid algorithm for the E. coli dataset was 97.61%. The average classification accuracies of the E. coli, zoo, spam email, and Pima Indian diabetes datasets with the proposed algorithm were 97.61, 99.55, 96.32, and 97.66%, respectively. According to the experimental results, the hybrid algorithm achieved a significant improvement in various performance indicators compared with the other methods. The proposed algorithm can be applied to sensor-related experimental data to classify multivariate data obtained by sensors to improve prediction results.
To build on the hybrid algorithm based on SA and BFO for mining imbalanced data proposed in this study, we make the following suggestions: (1) Improve the operation of the BFO algorithm by improving the chemotaxis, reproduction, and elimination-dispersion processes, and increase the classification accuracy of imbalanced data.