Speech Processing Based on Hidden Markov Model and Vector Quantization Techniques Applied to Internet of Vehicles

In this study, we develop an intelligence device to apply speech processing function in an Internet of Vehicles (IoV). The voice-based interactions will improve drive safety and in-time awareness of the vehicle status. This interaction can be achieved through speech recognition and response generation between the driver and the smart vehicle. Thus, the driver can focus on the driving. The proposed speech processing can be divided into three portions: (1) voice signal preprocessing, (2) speech recognition, and (3) speaker recognition. Firstly, speech signal pre-processing consists of five steps: sampling, pre-emphasis, frame, window function, and mel-frequency cepstral coefficients (MFCC), so as to be able to extract the characteristic parameters in the speech signal. Secondly, the speech model is built via the hidden Markov model (HMM), and the Viterbi algorithm is used to search the best sequence in the model to achieve the function of speech recognition. Finally, we use the Linde–Buzo–Gray (LBG) algorithm in vector quantization (VQ) to train for the speaker model, and then use cosine similarity to achieve the function of speaker recognition. The proposed speech processing function has been validated experimentally, and the experimental results demonstrate its feasibility for drivers to easily control the IoV system via voice-based command. In addition, the system distinguishes different speakers and provides the corresponding usage privileges, which will improve drive safety and in-time awareness of the general vehicle status.


Introduction
Internet of Vehicles (IoV) is an integrated network with wireless communication and information exchange to support traffic management, dynamic information service, and vehicle control. Intelligent transportation system is a typical application of IoV.
Most modern transportation vehicles are equipped with communication equipment [such as mobile devices, global positioning system (GPS) equipment, and embedded computers] and many sensors, which allow them to monitor, communicate, and process data. The mobile system that facilitates the communications between a vehicle and a vehicle, a vehicle and the road, a vehicle and a human, and a vehicle and sensors is realized by the use of communications protocols such as HTTP, TCP/IP, SMTP, and WAP. Both driving safety and convenience are improved by such a system, which is the concept of the IoV. (1,2) Riding the trend of the Internet of Things (IoT), Tesla, the US electric automaker, has been adding IoT systems into its automobiles for several years and has also developed self-driving cars with the help of machine vision. (3) Samsung Electronics acquired Harman International Industries, another US automobile maker, in 2016 and announced that it will invest 100 billion dollars in the IoV by 2025. It is estimated that 75% of cars worldwide will be connected to the internet by 2020, creating huge IoV business opportunities.
A smart vehicle will be integrated with the audio and speech system for spoken-command recognition, voice-based driver identification, and text-to-speech synthesis applications. Voicebased interaction provides hand-free and eye-free to command the GPS system, radio, and smart phone and allows a driver to focus on driving. These can be achieved by speed command recognition, speech understanding, and response generations. In addition, voice-based driver identification can offer smart vehicle to response to the driver's critical commands and ignore any commands. Therefore, the main goal of this paper was the design of an IoV system capable of voice recognition and speaker identification. The first one is dynamic time warping (DTW), (4)(5)(6) which can adjust to voice durations of different lengths, and this solves the problem of alignment in recognition. In the second method, hidden Markov model (HMM) (7,8) statistics were used to describe voice signals and derive a probabilistic model, making the method suitable for continuous voice recognition. There are two main speaker identification methods: the Gaussian mixture model (GMM) (9)(10)(11) and data clustering. (12)(13)(14)(15) In the first method, the feature parameters are modeled by GMM and long-term statistics are then taken from the model during recognition. This method has a high recognition rate but is computationally intensive. In the second method, a model is generated from voice feature parameters by clustering. This method has a low recognition rate and its computation time varies with codebook size. This paper has five sections. In Sect. 1, research motivation and background are introduced. With a voice processing literature review, which is followed by research objectives and ideas. The voice processing system architecture and related hardware used in this paper are discussed in Sect. 2. The pre-processing of voice signals, feature parameter extraction, model training and voice recognition methods are presented in Sect. 3. Speaker model training methods and speaker identification are described in Sect. 4. Recognition results of the voice processing system and its application to IoV are discussed and explored in Sect. 5.

System Architecture and Hardware
In this study, a voice-based processing system is designed and its applications to IoV are developed. The hardware system architecture adopted in this study is shown in Fig. 1 and includes client, server, and IoV. The user opens a web page on the voice server from a smartphone, tablet, or in-vehicle device. The user gives commands by recording his voice with the built-in microphone in the client device and then transmitting via WebSocket protocol to the server for decision and execution. The server function includes a user interface, voice processing, and IoT control and was coded using the HTML5 and JavaScript web development languages. Voice processing turns the signals (received under WebSocket protocol) into commands. The commands are then passed to the IoV server and used to control IoV devices. In the IoV network, interface and data processing are implemented by Arduino Yun sensors, and other electronic devices allow the data to be uploaded to a cloud server or they can be used to turn vehicle devices on or off.

Extraction of voice features
Voice signal processing involves converting the continuous analog signals of the human voice into discrete digital signals. Further processing by computers facilitates the communication between humans and computers. The process is divided into voice signal preprocessing and voice feature extraction. The four steps involved in voice signal preprocessing, (16)(17)(18) digital sampling, pre-emphasis, framing, and windowing, are shown in Fig. 2.
The human hearing range lies between the frequencies of 20 and 20000 Hz. In fact, humans are more sensitive to low-frequency sounds and thus easier to discern the distinctions in low frequency situations. To accommodate this auditory property, the MFCC method (18)(19)(20) was employed to extract the voice signal features, as seen in Fig. 3.

HMM based voice recognition
The goal of voice recognition is to convert the human voice into corresponding words through computers. Two main methods of voice recognition are DTW (4,5) and HMM. (7,8) Since the recognition with DTW is achieved by adjusting the time axes of two voices, a shared voice model is difficult to build. On the other hand, voices modeled through a long-term statistical process as in the HMM method deliver better performance in continuous voice recognition. When building an HMM acoustic model, phrases, words, phones, or phonemes can be used as units. Several Gaussian functions are also used to synthesize the probability density function. To approach a nearly arbitrary density distribution smoothly, a GMM is used in voice recognition to build an HMM model with the phone as a basic unit.
Since the distribution of voice signal features cannot be represented by simple probability distribution, GMM is used instead to describe the output probability in this paper, as shown by where o stands for feature parameters, μ is the mean vector of the Gaussian distribution, Σ is the weight of the Gaussian distribution, and n is the total number of feature parameters. The forward-backward algorithm is used to train the HMM model to start with and, for the known model λ, forward variables α t (i) are defined as (2) α t (i) is the probability of observation sequence {o 1 , o 2 , ..., o t } at time t and state s i . Forward probability can be computed recursively with these variables. Figure 4 shows the concept of forward probability. 1. Initialization

Recursion
Next, backward variables are defined as β t (i) is the probability of the observation sequence {o t+1 , o t+2 , ..., o t } after time t and state s j . β t (i) can also be computed recursively using the following formula. Figure 5 shows the concept of backward probability. 1. Initialization

Termination
Once forward variables α t (i) and backward variables β t (i) are obtained, the evaluation problem can be solved as The Viterbi dynamic programming algorithm (17) was proposed by Andrew Viterbi in 1967. Figure 6 shows the concept, where the X-and Y-axes indicate the time and HMM states respectively. The circle represents the observed output probability of the state at the time. The line between circles shows the transition probability. The Viterbi algorithm finds the best path from lower left to upper right.
The variables δ t (i) are defined first as δ t (i) is the observation sequence with the highest probability before time t. Compute δ t (i) with iteration and arrange the form to obtain To obtain the best state sequence, ψ t ( j) is used to record the best path at time t and state j. Computation details are as follows. 1. Initial value 3. Termination

Backtrack path
The Baum-Welch re-estimation algorithm is then used to adjust HMM model parameters λ, so that the observation sequence will have the highest probability under the conditions of this model. When refining the model, a statistical method is sufficient for the re-estimation of parameters if both the observation and state sequences are available. However, in the HMM model, the observation sequence is known, but the state sequence is not. Therefore, in this study, we used the Baum-Welch re-estimation algorithm, (17) an instantiation of the expectation maximization (EM) algorithm, to solve this problem. Figure 7 shows the concept of the Baum-Welch re-estimation algorithm. Parameter reestimation can be represented with forward variables α t (i) and backward variables β t (i). For representation convenience, two variables are defined to simplify the formula. First, the variable ξ t (i, j) is defined as  Substitute the forward-backward algorithm into Eq. (17) and rearrange the form to obtain Then another variable γ t (i) is defined as The parameters of the HMM model can be estimated using the data in Fig. 7 and Eqs. (18) and (19) The computation equations are as follows.
Re-estimation equation of π: Re-estimation equation of a ij : Re-estimation equation of b j : After a series of model training, a voice model with the phone as the basic unit was built. To recognize a voice sentence, the trained models should be connected as shown in Fig. 8.

Vector Quantization (VQ)-based Speaker Identification
A human voice signal carries a great deal of unique information because each person's tone quality is different. The main causes of these differences, which can be used for speaker identification, include vocal organs, the content of speech, and the way the speech is delivered. Although expressions, features, and habits of a language can be learned, the characteristics of a vocal organ cannot be changed or imitated; thus acoustic features are used in most speaker identification methods.

Linde-Buzo-Gray (LBG) algorithm
The LBG algorithm (13) uses clustering to derive a data codebook and train feature parameters of a voice segment to reduce the amount of information in the original voice. The advantages of using this method include less distortion and a lower bit rate. The LBG algorithm is basically a k-means clustering algorithm, (21) which divides all voice feature parameters or training vectors into k clusters. The cluster centers become the codevectors that represent their clusters. Figure  9 shows the algorithm flow chart.

Initialization
Give the training vector X m and distortion threshold ε, which is a very small positive number.

Compute initial center and distortion rate
Set the number of codevectors N to be 1 and compute the mean center of the training samples c 1 * and total distortion rate D ave * , as shown in Eqs. (23) and (24). ...
3. Splitting Multiply each codevector by the coefficient of disturbance, as shown in Eq. (25), where i is the number of codevectors (i = 1, 2, …, n). Let N = 2N and each codevector be split into the following two codevectors:

Iteration
Let the initial distortion rate D ave (0) = D ave * . Reset the iteration index to 0.  (3) Increase the iteration index by 1 as shown by (29) (4) Compute and determine if the distortion rate is lower than the distortion threshold, as shown in Eq. (30). If not, go back to Eq. (26).

Termination condition
Check if the number of codevectors has been reached. If not, go back to step 3.
Here M is the number of feature parameters, X m the M-th feature parameter, D ave is the distortion rate, and C i (0) the center point of the i-th codevector in the 0-th iteration.

Cosine similarity and Euclidean distance
Cosine similarity, also called cosine distance, is a measure of similarity between two vectors in vector space by measuring the cosine of the angle between them, which will give an indication as to whether they are pointing in the same direction or not. The cosine similarity is 1 when the two vectors are pointing in the same direction. Two vectors at 90° have a cosine similarity of 0 and two diametrically opposed vectors have a cosine similarity of −1. Cosine similarity is dependent on vector direction and independent of vector length.
The law of cosines relates the lengths of the sides of a triangle to the cosine of one of its angles. Given the lengths of the three sides of a triangle, the angles can be computed using the law of cosines. Let the three sides of a triangle be r, s, and t, and their opposite angles be R, S, and T, respectively. The cosine of angle R is If the two sides s and t of the triangle are vectors, the above equation can be rewritten as where T i and S i are the components of the respective vectors T and S, respectively. The Euclidean distance is commonly used to measure distance. It is defined as the ordinary distance between two points in space. For speaker identification, the Euclidean distance between the test sentence and each N speaker model d(o, v) is computed, as shown in Eq. (33), and the model sample with the smallest distance difference is chosen as the identification result.
Here, v i is the codeword, c the codebook size, o the test voice, and T the test voice sequence.
Add o t together and compute the shortest distance between it and the codewords in the codebook. The codeword with the shortest distance is the identification result. In addition to the shortest distance computation, the lowest threshold can be added to identify whether the voice belongs to any of the clusters.

Receiver operating characteristic (ROC) curve
In the signal detection theory, the ROC curve (22) is a graphical binary classifier system used to select the best model or set the best threshold in the same model. When the signals or measurement results are continuous, a discrimination threshold must be used to define the boundary between classes. The Y-axis of the curve represents true positive rate (TPR), also known as sensitivity, which measures the proportion of correctly identified positives. Its formula is shown as (34) The X-axis represents the false positive rate (FPR), which is calculated as (1-specificity). Specificity measures the proportion of correctly identified negatives. The formula for FPR is shown as where TP stands for true positive; FP stands for false positive, a Type I error; TN stands for true negative; and FN stands for false negative, a Type II error. Figure 10 shows the flow chart of the proposed voice processing system in this paper, the theories presented in Sect. 5 are tested and the resulting system is used in the IoV. The results of the experimental procedure and analysis are as follows.

Experimental Results
(1) Recognition results of speaker verification (SV) (2) Recognition results of speaker identification (SI) (3) Results of system application to IoV.

Results of speaker verification experiment
The speaker verification experiment was carried out in a known speaker mode and was aimed at selecting better parameters to be used in the speaker identification system. Twenty model samples and 50 test samples (25 positive and 25 negative) were used.
From Fig. 11, it can be seen that no matter what the codebook size, ROC curves are close to diagonal lines. This shows that the use of Euclidean distance and 13-dimensional codebooks does not generate discriminating classification. Figure 12 shows the ROC curves produced using cosine distance and 13-dimensional codebooks. They are more discriminating than those in Fig. 11. However, the differences between them are not significant. The trained features are too close to each other and this might be because the codebook dimensions are not large enough. Figure 13 shows the ROC curves produced using Euclidean distance and 39-dimensional codebooks. Although better than those in Fig. 11, the curves of different codebooks are very similar and close to diagonal lines. There is still not enough discrimination. The classification results are fairly good when the codebook size reaches 256. However, the file was larger than that of feature parameters and the parameters were not used. Figure 14 shows the ROC produced curves using cosine distance and 39-dimensional codebooks. The discrimination improves steadily as the codebook size increases. Tables 1 and 2 show the equal error rates (EERs) formulated from the above recognition results. Considering both discrimination and the amount of computation needed, we adopted a 39-dimensional codebook with a size of 64 and cosine distance for recognition, which were incorporated into the speaker identification system.

Results of speaker identification experiment
The speaker identification experiment was carried out using the parameters in the previous section. Models for three speakers were built first and each speaker model was trained with 20       Table 3. The experimental results indicated a greater than 90% hit rate. This confirmed that the proposed device could provide promising results for response generations to the drivers' critical commands.

Results of system application for IoV
After it was completed, the voice processing system was tested with the IoV. Figure 15(a) shows the relay module used to simulate car interior lighting, and Fig. 15(b) shows the sensor module. In this module, the optical sensor measured the ambient light level outside the car and was used by the IoV to decide to turn the headlights on or off. A temperature/humidity sensor monitored the engine temperature to determine if it was within the normal range. Pressure sensors particularly flexible pressure sensor, were used to measure tire pressure.
The voice server in the operating state is shown in Fig. 16. The commands were given to the voice server through the user interface, as shown in Fig. 17, to access information on the vehicle (see Fig. 18) and control IoV-related devices (see Fig. 19). Figure 20 shows the results of speaker identification. The experimental results showed the IoV could respond to the drivers' requests and indicate warnings to the drivers, such as for headlight control, engine temperature monitor, unfastened seat-belt, and so on.

Conclusion
A voice processing application for an IoV system is presented in this paper and implemented using a Raspberry Pi and Arduino. The system comprised two main parts, namely voice recognition, which was built using the MFCC method to extract voice features, and HMM for long-term statistics. The model was then optimized with forward-backward, Viterbi, and Baum-Welch algorithms. Voice recognition was realized on the basis of this optimized voice model. The second part was speaker identification. Speaker models were built by voice encoding and the LBG algorithm that employs the concept of data clustering. The best thresholds were found through ROC curves and cosine similarity was then used to implement the functions of speaker identification. The results of voice processing were transmitted to the IoV system. From the results of recognition, the system determines whether the voice commands should be executed or not. The IoV system can monitor the vehicle situation via sensors and activate the related services automatically on the basis of the IoT concept.