Uniform Experimental Design for Optimizing the Parameters of Multi-input Convolutional Neural Networks

1Department of Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan 2Department of Electrical Engineering, National Formosa University, Yunlin 632, Taiwan 3Smart Machinery and Intelligent Manufacturing Research Center, National Formosa University, Yunlin 632, Taiwan 4Department of Electrical Engineering, National Chung Hsing University, Taichung City 402, Taiwan


Introduction
In traditional machine learning methods, image features must be defined and captured by the user in advance. (1,2) Recently, convolutional neural networks (CNNs) have been used to automatically capture features for overcoming the aforementioned problem. Therefore, CNNs have been widely and successfully applied in image recognition, (3)(4)(5) speech recognition, (6) colorimetric models, (7) and face recognition. (8,9) CNNs are the most commonly used architecture for deep learning, and they exhibit superior performance in image recognition. In 1998, LeCun et al. proposed the first CNN architecture called LeNet-5 (10) and applied this architecture to handwriting recognition. However, owing to problems such as excessive parameters, gradients, and lack of hardware equipment, the costs of the architecture exceeded its benefits. Deep learning was not popular with users in 1989. Krizhevsky et al. proposed the AlexNet (11) architecture and introduced the dropout method (12) to prevent the network from falling into overfitting. Many researchers have proposed deep CNN architectures. Popular CNNs, such as GoogLeNet, have been proposed by Szegedy et al. (13) and Simonyan and Zisserman. (14) Although CNNs have been successfully used in various fields, most of them only use a single input. Therefore, some researchers have explored dual-input CNNs. In 2015, Su et al. (15) proposed a multiview CNN for classifying 3D models. Through 3D model acquisition, 2D images with different perspectives were used as network inputs. The image features for multiple perspectives of an object were then combined. In 2017, Sun et al. (16) used a dual-input CNN for flower grading. They used three flower images at different positions as the input and combined the image features after a single convolution and pooling operation. In 2019, Li et al. (17) developed a dual-input neural network architecture for detecting coronary artery disease. Two types of signals, namely, electrocardiogram and phonocardiogram signals, were used as the network input. The features of the two signal types were combined to improve classification accuracy. In the aforementioned studies, the architecture parameters were selected by the user through trial and error.
The basic CNN architecture is shown in Fig. 1. The architecture comprises an input layer, an output layer, and multiple hidden layers. The kernel size, stride, and padding of the filters in the convolution layer and the pooling window in the pooling layer are determined by users according to experience. However, major design problems occur as the depth of a CNN increases. The CNN parameters selected by the user are not the optimal parameters. To determine the optimal parameters of a CNN architecture, continuous learning experimentation is required. In the engineering field, two methods are commonly used for optimizing parameters: the Taguchi method (18)(19)(20)(21) and uniform experimental design (UED) method. (22)(23)(24)(25) The Taguchi method is simpler in design than the UED method; however, the Taguchi method is only suitable for experiments with few levels and factors. The minimum number of runs required in the Taguchi method is equal to the square of the level. Compared with the Taguchi method, the UED method requires fewer runs. The UED method uses multiple regression to find the optimal parameters in the shortest possible time.
To overcome the drawbacks of a single-input architecture, in this paper, we propose a multiinput CNN based on UED for gender classification applications. The proposed CNN uses multiple CNNs to obtain output results through individual training and concatenation. To avoid using trial and error for determining the architecture parameters of the CNN, UED was used in this study. Under UED, multiple regression analysis is used to obtain the optimal parameters. Different numbers of inputs and different CNNs were used in the experiments of this study to verify the suitability of the proposed method for application to the CIA and MORPH datasets.
The remainder of this paper is organized as follows. Section 2 introduces the UED method. The architecture of the multi-input CNN, which comprises a convolutional layer, pooling layer, and fully connected layer, is described in Sect. 3. Section 4 presents the experimental results of the dual-input CNN for the CIA and MORPH datasets. Section 5 describes the effects of different numbers of inputs and different CNN architectures. Section 6 presents the conclusions and future research directions.

UED Method
In the UED method, multiple regression analysis is used to determine the optimal parameters. The number of runs required in the UED method is considerably lower than that required in the Taguchi method. A small amount of time is required to find the optimal parameters in the UED method. For an experiment with three factors and three levels, the Taguchi method requires at least nine runs, whereas the UED method requires only five runs. A uniform layout (UL) is denoted by U a (a b ), where U is the UL symbol, a is the number of levels and experiments, and b is the number of parameters. The overall design process is displayed in Fig. 2. The steps in the design process are given as follows: The first step involves selecting the factor to be improved. Consider the basic CNN displayed in Fig. 1. The basic CNN has six affecting factors, including the kernel size, stride, and padding. The values of these parameters are preset ( Table 1).
The number of experiments is determined according to an affecting factor as follows: where n is the number of experiments and S is the affecting factor. If the number of experiments is less than 12, the uniformity is poor. Therefore, the number of experiments must be greater than 12. Consequently, the number of experiments is set to 13, and the affecting factor is 6. The second step involves calculating the total number of columns from the number of experiments.
After determining the numbers of experiments and columns, the content of table x i,j can be obtained as follows: where i = 1, 2, 3, ..., m, j = 1, 2, 3, ..., n. For example, if m is 12 and n is 13, UL is represented as U 13 (13 12 ). The initial UED table is presented in Table 2. The third step involves selecting the use table according to U 13 (13 12 ). As presented in Table 3, when the affecting factor is 6, the row comprising the numbers 1, 2, 6, 8, 9, and 10 is selected. The new UED table is presented in Table 4.

Multi-input CNN
This section describes the proposed multi-input CNN. The term "multi-input" refers to the training of CNNs by using multiple inputs. This section uses the dual-input AlexNet network architecture as an example. Figure 3 displays a dual-input AlexNet network architecture. Two different inputs are fed into two identical CNN architectures. After the AlexNet calculation is completed, the data are combined through concatenation and the characteristic information is then passed to the fully connected layer for classification.
The CNN architecture can be freely selected in the proposed network. Three wellknown CNN architectures, namely, LeNet, AlexNet, and GoogLeNet, are commonly used by researchers. In this study, we focused on the AlexNet network architecture. The AlexNet architecture is more popular in applications than the LeNet and GoogLeNet architectures because its size is between those of LeNet and GoogLeNet.
AlexNet has two main characteristics. First, it uses a nonlinear activation function [i.e., rectified linear unit (ReLU)] with a high convergence speed. Prior to the development of AlexNet, most neural networks used the sigmoid function with the vanishing gradient problem as the activation function. The ReLU function has a simpler operation than the sigmoid function, and only a threshold is required to obtain the activation value with the ReLU function. Second, the use of the dropout method in the first and second fully connected layers of the AlexNet architecture can effectively reduce the occurrence of overfitting.
To determine the optimal parameters of the multi-input CNN, the UED was used in this study. The entire experimental process is displayed in Fig. 4. In the first step, the parameters to be optimized are selected in the CNN architecture. In the second step, a UED is used to find the optimal weight through multiple regression analysis. The third step involves determining the optimal parameters by using the optimal weight. The fourth step involves confirming whether the UED provides the highest possible accuracy rate. If yes, the process is completed; otherwise, the process returns to the second step.

Convolution layer
In the convolution layer, the mask of the convolution kernel is used to perform a convolution operation on the input matrix through the sliding window method. Figure 5 illustrates the convolution process. In Fig. 5, the length and width of the input image are both 5, the length and width of the convolution kernel are both 3, and the stride is 1. The output matrix size is obtained using the following equation: where W o and H o represent the width and height of the output matrix, respectively; W i and H i represent the width and height of the input matrix, respectively; p is the padding size; and s is the stride size. The output matrix (O RC ) of the convolution operation is expressed as follows: ( )( ) where K h and K w represent the width and height of the convolution kernel, respectively. In general, the size of the convolution kernel is equal to its width and height (i.e., K h = K w ). The term k ij represents the weight of the convolution kernel and x ij denotes the input image matrix.

Pooling layer
In the pooling process, a mask is used to perform operations on the input matrix with a sliding window. This process is similar to the convolution operation. The only difference is that the mask does not overlap elements in the pooling process. In other words, each element in the input matrix is only covered once by the mask. Therefore, the dimensionality of the matrix can be reduced through the pooling process.
Two types of pooling operations exist, namely, maximum and average pooling. In the maximum pooling operation, the largest value in the mask is used as an output, as displayed in Fig. 6. In the average pooling operation, the average of all the values in the mask is used as the output, as depicted in Fig. 7.

Activation function
The ReLU function is used as the activation function in the convolutional and fully connected layers of the proposed CNN architecture. The final outputs are obtained through the softmax layer. The ReLU function is a nonlinear function. If the input a is greater than 0, the output is equal to a. Conversely, if the input a is less than or equal to 0, the output is 0. The formula for the ReLU function is given as follows:

Experimental Results
To evaluate the proposed multi-input CNN, the AlexNet network and two face datasets, namely, the CIA and MORPH datasets, were used. The image data were incremented by performing the brightness reduction, brightness increase, rotate left, and rotate right operations on the two datasets, as displayed in Fig. 8. Therefore, the amount of incremented data was five times that of the original data. In the experiments, to perform cross-validation, three sets of training and testing data were randomly generated from the data. The average values obtained for the optimized parameters from three experiments were used to ensure overall fairness.

Parameter definition in the UED method
To obtain the optimized parameter structure of the multi-input AlexNet, the UED method and multiple regression analysis were used. In this subsection, a dual-input CNN is used as an example. Table 5 shows that the improvement factors selected in AlexNet included the kernel   1  9  2  0  3  1  1  2  11  4  1  5  2  2  3  13  -2  7 --size, step size, and padding of the convolution kernel in the first and fifth convolutional layers. The UED table presented in Table 6 was obtained using the steps mentioned in Sect. 2.

CIA dataset
The CIA dataset is a small facial image database that was collected by our laboratory. The database mainly comprises the facial images of Chinese individuals aged between 6 and 80 years, as displayed in Fig. 9. The amount of data obtained after the image increment was five times the amount of original data, as presented in Table 7. Under the UED method, experiments were performed under 13 sets of parameter values. The optimized parameters and classification   results for the CIA dataset are presented in Table 8. The highest experimental accuracy rate (99.60%) was obtained for the second set of experimental parameters. Multiple regression analysis was then performed on the CIA dataset to achieve an accuracy rate of 99.68% for the optimized parameters. The experimental results indicate that the UED method improved the accuracy by 0.08%.

MORPH dataset
The MORPH dataset is a collection of the facial images of various people aged between 16 and 77, as displayed in Fig. 10. The numbers of images before and after the increment process for the MORPH database are presented in Table 9. In the MORPH database, the average interval between successive image captures for each person is 164 days. The database does not include any continuously shot images. The 13 experimental results obtained in the UED method are presented in Table 10. The accuracy of the seventh experiment was 98.68%, which was the highest accuracy among all the experiments. The accuracy of the optimized parameter combination was 99.06%, which was 0.38% higher than that of the seventh experiment.

Multi-input CNN
In this subsection, we discuss the accuracy of the proposed multi-input CNN architecture. Figure 11 shows a three-input AlexNet architecture. The definition of the parameters and the UED table for this architecture are the same as those for the dual-input architecture in Sect. 4. The training and testing data comprise images from the MORPH dataset. Table 11 presents the UED results for the three-input AlexNet. The average accuracy of the optimal parameter combination for this architecture was 99.16%, which is higher than the highest accuracy in the UED table by approximately 0.47%. Figure 12 shows the architecture of a four-input AlexNet. Table 12 presents the UED results for the four-input CNN. The average accuracy of the optimal parameter combination for this CNN was 99.19%, which is higher than the highest accuracy in the UED table by approximately 0.57%. The aforementioned results indicate that the proposed architecture can arbitrarily increase the number of inputs to form a multi-input AlexNet. The experimental results for the optimized parameters in the UED method are displayed in Fig. 13. The accuracy rates of the dual-input, three-input, and four-input AlexNet were 99.06, 99.17, and 99.20%, respectively. The accuracy    increased with an increase in the number of inputs; however, the overall network speed decreased and the hardware costs increased.

Effect of various networks
In the aforementioned experiments, we adopted AlexNet. Because the AlexNet network is deeper than the LeNet network, it has a higher accuracy rate than the LeNet network. In addition, the AlexNet network is shallower than the GoogLeNet network. We replaced AlexNet with LeNet and GoogLeNet in the dual-input CNN. Table 13 indicates that the average accuracies of the LeNet, GoogLeNet, and AlexNet networks were 99.18, 99.07, and 99.30%, respectively. As displayed in Fig. 14, AlexNet had a higher average accuracy than the other two architectures. In theory, the accuracy rate of GoogLeNet is higher than that of AlexNet. However, because GoogLeNet has more layers than AlexNet, image features disappear when using the GoogLeNet architecture.
Different features can be obtained using multi-input CNNs. Determining which network features improve the accuracy of the entire system is a crucial task. Suitably integrating these features is also critical. Because the individual features obtained by multiple networks provide different interpretations of the same image, some features may allow the network to determine the correct result, whereas others may cause serious misjudgment. To solve this problem, a multilayer network fusion mechanism is added to the output of a feature network. This mechanism partially enhances or suppresses the original output features to perform a fusion operation. Thus, multiple features can be combined together to improve the overall recognition rate. Many researchers (26)(27)(28) have presented feature fusion techniques for multiinput CNNs. In this study, we compared the proposed method with other methods. (26)(27)(28) Table  14 presents the accuracy results of the proposed method and other methods for the MORPH and CIA datasets. The experimental results indicate that the proposed method has a higher average accuracy than the other methods (26)(27)(28) for the MORPH and CIA datasets.

Results of single-and dual-input networks
This subsection presents the results of single-and dual-input networks for the CIA dataset. Table 15 lists the experimental results of single-and dual-input AlexNet architectures. The average accuracy of the dual-input AlexNet was 1.14% higher than that of the single-input AlexNet.

Experimental results for other datasets when using the proposed method
The CIFAR-10, (29) CIFAR-100, (29) Birdsnap, (30) Standford cars, (31) Flowers, (32) FGVC aircraft, (33) Oxford-IIIT pets, (34) and Food-101 (35) datasets were also used to verify the proposed method. Table 16 presents the experimental results for the aforementioned datasets when using the proposed method. The proposed method exhibited a suitable average accuracy for these well-known datasets. Table 14 Comparison of the accuracy of various methods for the MORPH and CIA datasets.

Conclusions and Future Work
To overcome the drawbacks of single-input network architectures, a multi-input CNN based on the UED method is proposed in this paper for gender classification applications. The proposed multi-input CNN uses multiple CNNs to obtain output results through individual training and concatenation. To avoid using trial and error for determining the architecture parameters of the proposed network, a UED was used. In a UED, multiple regression analysis is used to determine the optimal parameters. The accuracy rates of the dual-input, three-input, and four-input AlexNet were 99.06, 99.17, and 99.20%, respectively. The average accuracies obtained when using the LeNet, GoogLeNet, and AlexNet networks with a dual input were 99.18, 99.07, and 99.30%, respectively.