Dual-input Control Interface for Deep Neural Network Based on Image/Speech Recognition

The objective of this study was to design a control interface for dual-input video/audio recognition consisting of two input interface systems, hand posture and speech recognition, with the use of specific hand postures or voice commands for control without the need for wearable devices. Original video camera images were preprocessed for hand posture recognition, and the face in the image was used as the reference point and identified using the Adaboost classifier. An image of a specific size was selected as the recognition input image to increase the recognition speed. A neural network comprising convolutional, activation, max pooling, and fully connected layers was used to classify and recognize hand posture images as well as speech. Long short-term memory (LSTM) in a recurrent neural network (RNN) was used to achieve speech recognition. Speech features were extracted by preprocessing, and Melfrequency cepstral coefficients (MFCCs) and a fast Fourier transform (FFT) were then used to convert the signals from the time domain to the frequency domain. The frequency domain signals subsequently underwent a discrete cosine transform through triangular bandpass filters to derive MFCCs as the speech eigenvalue input. The speech feature parameters were then input to the LSTM neural network to make predictions and achieve speech recognition. Experimental results showed the image/speech dual-input control interface had good sound recognition capability, supporting the findings of this study.


Introduction
The convolutional neural network (CNN) (1) was first proposed by LeCun in 1989. A CNN is a type of deep learning model with layered learning features. However, owing to the limited performance of computer hardware at the time, the CNN concept could not be effectively realized. However, modern graphics processing units (GPUs) are very powerful and their substantial computing power has been effectively applied in deep-learning computing. As a result, CNN development has flourished and CNNs are widely used in many fields, such as object detection (2) and face recognition. Current applications are focusing on the development of artificial intelligence, and natural language processing and speech recognition have been established. (3) The successful application of LeNet-5 (4) in the field of handwritten character recognition has also drawn the attention of the academic community towards CNNs. The features learned through CNNs display a stronger discriminative and learning ability than artificial design features. Long short-term memory (LSTM) (5) was proposed by Hochreiter and Schmidhuber in 1997 and the advantage of "recurrence" has been widely applied in speech recognition and emotion analysis. A CNN and LSTM were used as the main structures in this study in view of the excellent image/speech input feature recognition capability of CNNs.

Introduction of the Hardware and System Environment
The aim of this study was to enrich the control interface of a quadrotor system with control signals of gestures and voices, where the system structural diagram is shown in Fig. 1. It consists of two parts: the ground computing terminal and the quadrotor flight controller. Wireless signals were transmitted between them using Wi-Fi.

Hand Posture Recognition
The images captured using a video camera were processed and then classified using a CNN, which consisted of convolutional, activation, maxpooling, and fully connected layers, to establish hand posture recognition from images.

Image preprocessing
The preprocessing of the images included conversion from color to grayscale. The Adaboost classifier (6,7) was used to identify the face in the image as a reference point, and an image of a specific size was selected as the recognition input. The 45° Haar rectangle feature (8) proposed by Lienhart and Maydt was used to extract face sample features, which were placed in the classifier for use in training. After classifier training, the robust linear tandem classifier (9) formed by multiple weak classifiers was obtained and is shown in Fig. 2.

CNN network structure
The CNN network architecture used in this paper is shown in Table 1.

Speech Recognition
Speech commands were input to a microphone and converted into digital potential signals by an analog-to-digital converter (ADC). After the features had been obtained, the LSTM (10) in a recurrent neural network (RNN) (11) was used to perform speech recognition. Speech recognition was accomplished using voice signal preprocessing, speech eigenvalue extraction, and the RNN.

Voice signal preprocessing
Voice signals were preprocessed to extract the required eigenvectors. Preprocessing included digital sampling, preemphasis, frame, and window function steps, as shown in Fig. 3.

Speech feature extraction
The signals were converted to the frequency domain using a discrete Fourier transform and observed using the energy distribution, and Mel-frequency cepstral coefficient (MFCCs) (12)(13)(14) were used to extract features. Log energy and the delta cepstrum were added to increase the diversity of the eigenvalues. The MFCC process is shown in Fig. 4. (15)

LSTM
LSTM was employed to avoid the problem of gradient disappearance. The concept of the memory cell and gate was incorporated into the RNN. Figure 5 shows a single memory block of LSTM, (16) which includes three gates: Input, Output, and Forget. The gates are all nonlinear  summing units. The main function of a gate is to determine whether to accept an input signal, to forget the past status, or to output a predicted result, as well as other actions. The status update process can be represented by the following equations: Output gate Forget gate New memory cell ( ) Update memory cell Update hidden cell tanh Here, t represents the time step, x t is the input vector of the LSTM block, σ(.) is the logistic sigmoid function, tanh(x) represents the hyperbolic tangent function, w i , w o , w f , and w c represent the weights of the matrices, t c represents the candidate value, c t represents the cell status output, and c t−1 represents the status output of the previous unit.
After speech was input, eigenvectors were extracted through speech preprocessing, and speech features were extracted using MFCCs. Finally, the processed speech features were placed in the LSTM to carry out recognition. Recognition results were then converted into appropriate control signals.

Experimental Results
This paper is divided into two parts. The first part is concerned with hand posture recognition involving input images for gray scaling and the use of the Adaboost face recognition classifier to identify a face in the image as a reference point; the second part involves speech recognition voice signals, which underwent preprocessing to derive speech eigenvectors. MFCCs were then used to extract the speech features and the LSTM was used to perform recognition.

Experimental results of hand posture recognition
The images used for the input were 640 × 480 pixels in size. About 50000 images were used for training and 5000 images were used for testing. Six different instructional postures were used as shown in Figs. 6-8.
Since the preset device in this paper was a quadrotor drone, the video camera was situated 2 m above ground level and 3.5 m from the user, as shown in Fig. 9.
After the input image had been grayscaled, the Adaboost face classifier was used to identify the face in the image as the reference point, and the face was selected as the input image for turning angle recognition, with a size of 40 × 40, as shown in Fig. 10.   After the input image had been converted to grayscale, the Adaboost face recognition classifier was used to identify the face in the image as the reference point and select it as the input image for turning angle recognition. The computing time was also reduced in addition to the improved recognition rate. As shown in Fig. 11, the samples used for these datasets were from 10 people, although the recognition rate may be reduced if more people need to be recognized; the image on the left is the uncut original image. The face and upper body were selected using Adaboost.
The posture recognition interface used in this study had two CNN structures, the batch size was 50, and each epoch had 1000 iterations. The first CNN determines the angular view of the body presented to the camera based on the direction of the face, as shown in Fig. 12 and Table 2. The second CNN predicts the hand posture output as shown in Fig. 13 and Table 3. Figures 14-16 show actual recognition images. In recognition interface windows, the pictures show the postures of front, left oblique angle of 45°, and right oblique angle of 45° respectively, and the corresponding results of posture, appeared at the upper right window after recognition was completed.

Experimental results of speech recognition
The training datasets of the speech recognition system contained 1187 data entries. Among these, six were voice instructions: left, right, front, behind, up, and down. Ten students recorded each instruction 20 times to make the training datasets and the recording time was 2 s.       To increase the rate of recognition, the samples included loud and soft commands as well as fast and slow, and low-and high-pitched voices. The original datasets contained 1200 samples, but after the defective samples had been removed, 1187 samples remained, which were used as training data. Figure 17 shows the signals of the original voice instructions. The speech feature extraction involves extracting features by preprocessing the speech signals and MFCCs. Figure 18 shows the spectrograms of the voice instructions converted to the frequency domain. In this study, LSTM improved by the RNN was used for speech recognition prediction. Figures 19-21 show flowcharts of the LSTM, RNN, and CNN, respectively. The LSTM was used as the speech recognition system in this paper. Table 4 shows the approximate accuracy rates of the LSTM, RNN, and CNN in terms of speech recognition. Because the input voice signals used in this study had no correlation with the preceding or succeeding text and were simply single words input as sounds into the LSTM, the RNN memory function was not effectively implemented. Figure 22 shows the image patterns obtained from actual speech recognition.

Conclusion
In this study, a dual-input control interface realized for a deep neural network was implemented. A CNN and LSTM were used to achieve hand posture and voice recognition. This made control by a specific posture or voice command possible without the need for a wearable device. In the hand posture recognition, the images were used for the CNN parameter training of the data server. A Raspberry Pi3 Wi-Fi transmission module was used to transmit images to the video processing device where hand posture recognition was performed. In the speech recognition, speech data were also used to train LSTM parameters through the training server. Voice signals were preprocessed and MFCCs were used to obtain speech feature parameters and LSTM was used to make predictions.