Unsupervised Recurrent Neural Network with Parametric Bias Framework for Human Emotion Recognition with Multimodal Sensor Data Fusion

In this paper, we present an emotion recognition framework based on a recurrent neural network with parametric bias (RNNPB) to classify six basic emotions of humans (joy, pride, fear, anger, sadness, and neutral). To capture the expression to recognize emotions, human joint coordinates, angles, and angular velocities are fused in the process of signal preprocessing. A wearable Myo armband and a Kinect sensor are used to collect human joint angular velocities and angles, respectively. Thus, a combined structure of various modalities of subconscious behaviors is presented to improve the classification performance of RNNPB. To this end, two comparative experiments were performed to demonstrate that the performance with the fused data outperforms that of the single modality sensor data from one person. To investigate the robustness of the proposed framework, we further carried out another experiment with the fused data from several people. Six types of emotions can be basically classified using the RNNPB framework according to the recognition results. These experimental results verified the effectiveness of our proposed framework.


Introduction
Emotions have an important effect on a person's daily life. It is crucial to read emotions accurately and effectively from other people to avoid misunderstanding in interpersonal interactions. The ability of perceiving, understanding, and handling of one's own and others' emotions can be regarded as an expression of emotional intelligence. (1) It is one of the important abilities for individual survival. Moreover, available studies have shown that the skills of emotional intelligence have a high correlation with our mental health. (2) Emotion recognition using computing techniques has attracted increasing attention in recent years. The applications of emotion recognition, such as human-robot interaction (HRI), autonomous driving vehicles, intelligent surveillance systems, and entertainment, are very popular in our lives. (3)(4)(5)(6)(7)(8) For an intelligent robot, endowing it with the ability of emotion recognition and cognition is very helpful for detecting and identifying human emotional states, reasoning, making decisions, and reacting to human expressions appropriately in HRI. For example, the capability of understanding unspoken intentions or feelings exactly through autistic children's physical behavior can help a robot grasp their mental status and adjust the topic timely as needed in the interactive communication process. (9) As a complicated mental state, emotions often result in physical and psychological changes. These changes are associated with many internal and external activities. The internal activities include electroencephalogram (EEG), electrocardiograph (ECG), and electromyography (EMG) signals. The external activities involve body languages that are affected, mediated, and even regulated by emotions. In fact, body language, especially sensorimotor behavior, is usually subconscious; thus, it is rarely deceptive. Therefore, sensorimotor behaviors can be used to distinguish different emotions.
Various types of features have been utilized to recognize emotion successfully by different modeling methods for these features. These features include human facial expression, text, voice intonation, and some physiological signals, such as EEG and electrooculography (EOG). From these cues, one of the most popular features is facial expression. A number of emotion classification methods based on facial expression have been studied. (10)(11)(12)(13) The commonality of these methods is that the features usually are appearance features, geometric features, or a hybrid of appearance and geometric features of the target face. For the appearance features, the information that describes the texture of the face is often extracted from different face or global face regions. (10,11) The geometric features are usually constructed as a feature vector by using the relationship between different facial components. (12) As for the hybrid features, the authors combined the advantages of appearance and geometric features to provide better results in certain cases. (12,13) Although numerous studies mainly focus on facial expressions, there is increasing attention on other channels such as EEG, voice, and text. (14,15) Some advanced approaches have also been explored and developed to prove that multimodal information outperforms a single modality in recognition results. (16) However, most of the previous studies concentrated on supervised methods to recognize emotion using labeled datasets, and few studies focused on unsupervised methods by using human behaviors from ordinary users. As with any supervised learning problem, once we pick a model to classify emotions, it is difficult to obtain a labeled and sufficiently large training set. First, collecting emotion data and tagging those huge data are very troublesome and time-consuming. Second, we have to take someone's true emotion into account to evaluate the effectiveness of the data. Since videos or other signals do not always generate corresponding emotions for the user, nobody is sure whether the features are sufficiently reliable before feeding into algorithms. Aside from that, there is still another problem, that is, the interface of the device is often unfriendly and inconvenient to acquire data. To address the problem of tagging huge data, an unsupervised method with generalization ability is a promising solution. The wearable devices that are easy to use for ordinary users provide an alternative way to collect and train the data in our daily life. Hence, a possible solution to address these problems is using the unsupervised method to recognize emotion with more believable emotion features collected by a wearable device.
The wearable devices are usually used to capture sensorimotor behaviors, particularly human joint movements. Human joints from sensorimotor behaviors have been reported to be one of the critical features of emotion. (17) In our previous study, we applied the continuous joint coordinates from human nonverbal behavior to classify five emotions (joy, pride, fear, anger, and sadness) using unsupervised methods. (18) However, we captured behavior using only the Kinect sensor, which does not consider the advantages of the wearable device and multimodal data fusion. Many studies have shown that multimodal information can improve recognition performance. (19) It is also an interesting challenge to merge different modalities of information together and apply data fusion technologies to achieve the purpose of understanding emotions. On the other hand, most researchers have concentrated on integrating auditory and visual modalities to recognize emotion. (20) In contrast, a few research efforts have centered around human joints in multimodal emotion recognition, such as human joint angles and joint angular velocity. Compared with physiological signals and videos, different modalities of information of the joint convey more abundant and essential cues of human emotional states.
In this paper, to integrate the spatial and continuous temporal features of human joints, we present an unsupervised framework called the recurrent neural network (RNN) with parametric bias (RNNPB) to perceive six emotions (joy, pride, fear, anger, sadness, and neutral). The Kinect sensor was used to obtain joint coordinates and angles, and the Myo armband was used to collect the joint angular velocity. The main contributions of this paper are summarized as follows.
(1) Compared with other emotion recognition methods, multimodality signals using Kinect and Myo armbands are employed to achieve an easy and fast deployment of the sensors. We also demonstrate that using these two sensors leads to more accurate results in our learning framework. (2) Human emotions are recognized by bodily behaviors using an unsupervised framework.
This framework can overcome the disadvantages of usual supervised emotion recognition methods that need a large number of labeled training data. (3) Because of the generalization ability of the proposed framework, six untrained emotional behaviors (joy, pride, fear, anger, sadness, and neutral) collected from different people are well classified.

Preliminary
In this section, we first introduce the framework of the proposed method. Some relevant devices for acquiring data are also described in detail.

General overview
The framework of emotion recognition by RNNPB is presented in Fig. 1. A Kinect sensor and a Myo armband were used to capture different modalities of human behaviors. Joint coordinates, angles, and angular velocities of human behaviors were simultaneously collected while people were presenting certain actions in the process of data collection.

Kinect sensor
The Kinect for Windows v2 sensor (Kinect V2) was used in our work. It contains three vital pieces: an RGB color camera, an IR emitter, and a 3D depth sensor to provide color, IR, and depth images, as shown in Fig. 2(a). With these devices, the Kinect sensor can track up to human skeletons, capture full-body 3D motion, and recognize simple gestures. Compared with Kinect V1, Kinect V2 can track 25 body joints. In this paper, Kinect V2 was used to collect 3D joint coordinates and angles of human behaviors.

Wearable device (Myo armband)
The Myo armband [ Fig. 2(b)] is a body-wearable and portable device produced by Thalmic Labs. It is a lightweight elastic armband consisting of a number of metal contacts. These metal contacts can measure electrical activity in a user's forearm muscle to transmit gestures that he/ she makes with his/her hands to a control computer via Bluetooth. Therefore, the Myo armband  allows the user to control his/her cell phones, computers, and other favorite digital technologies wirelessly with hand gestures and motions by reading the electrical activity of muscles and the motion of the arm. Hand gestures and motions are detected by proprietary EMG muscle sensors and a highly sensitive motion sensor separately. The Myo armband is used to capture the joint angular velocity of the human arm.

Data Collection
In this section, the specific process of data collection will be described; this process includes joint coordinates, human upper body joint angles, and joint angular velocities from emotionaroused human body behavior. Since the Kinect sensor can capture human joint coordinates directly, the details on how to acquire the joint coordinates of a human body will not be introduced.

Joint angles captured by Kinect sensor
The Kinect sensor can track up to six people's whole skeletons within its view at one time. Each skeleton has 25 joints. These joints are numbered 0-24 [ Fig. 3(a)]. Through the RGB camera and depth sensor of Kinect, we can acquire the 3D coordinates of each joint for an object human body. As shown in Fig. 3(b), skeletons can be tracked regardless of whether the object human body is standing or sitting. Note that the Kinect sensor treats joints as one person is looking in the mirror. Thus, the "left side" human body joints are on the left in Fig. 3 and the "right side" human body joints are on the right.
Once we obtain the 3D coordinates of the human joints, the joint angles can be calculated by the space vector approach. Assuming that there are two points ( , , ) Q x y z = in 3D space, the distance between these two points can be calculated as where vector and d PQ is the distance between the points P and Q.
Using the law of cosines, the angle between two vectors can be calculated easily. (21) Similarly, the angle between two joints can be obtained by applying the same method. In the Kinect coordination, a joint can be regarded as a vector. Assume that joint 1 is expressed as OA  and joint 2 is expressed as OB  ; then, the angle between these two joints can be computed as According to Eq. (1), the coordinates obtained by the Kinect sensor can be converted to vectors and the corresponding angles can be calculated using Eq. (2).
In this work, only the upper human body joint angles consisting of left and right arms joint angles were collected. Since each arm has seven degrees of freedoms (DoFs), 14 joint angles were captured in total, which include the shoulder pitch angle, shoulder roll angle, shoulder yaw angle, elbow pitch angle, elbow roll angle, wrist pitch angle, and wrist yaw angle for both left and right arms. Figure 4 shows the specific angle calculation process of a left arm based on the space vector approach. The black dotted lines OX, OY, and OZ are the Kinect's 3D coordinate system in Cartesian space, and the red dotted lines are auxiliary lines. The shoulder pitch angle CDE ∠COD was computed using Eq. (2)  The corresponding angles of the right arm were computed in the same way. Thus, the angles of human body behaviors were acquired and these angles were fed into the unsupervised algorithm together with other modality data to perceive human emotions.

Joint angular velocity collection by Myo armband
To obtain the joint angular velocity, human subjects need to wear two Myo armbands for each arm. One of the Myo armbands is worn near the center of the forearm and the other one is worn near the center of the upper arm. The Myo armband uses quaternions to obtain the joint angle and then collects the joint angular velocity by computing the basic change in joint angle. According to Yang et al., if the relevant joint angles are zero, any position of the human arm can be regarded as the initial position. (21) When the human arm is moved to a new position U, the corresponding angle from the initial position to pose U is the rotation angle, namely, the joint angle.
We assume that the initial orientation of the Myo armband is denoted by frame (X l1 , Y l1 , Z l1 ), and that the current orientation of the Myo armband is denoted by frame (X l2 , Y l2 , Z l2 ). Then, the angular velocities of the shoulder pitch v lx , shoulder roll v ly , and shoulder yaw v lz can be obtained by the Myo armband worn on the left upper arm. The angular velocities of the elbow pitch v l2x and elbow roll v l2y were acquired by the Myo armband worn on the left forearm. Then, five joint angular velocities of the right arm, v lx , v ly , v lz , v l2x , and v l2y , can be acquired in the same way. Ten joint angular velocities for human arms were taken from the Myo armband in total.
Before collecting data, the participants were required to stand in front of the Kinect sensor wearing two Myo armbands for each arm. After that, the training data was captured by the devices with two computers while the participants were showing emotional behaviors. One was used to obtain the joint coordinates and angles; the other was used to obtain the joint angular velocities of the left and right arms separately. For each emotion, two sequences were collected from one person. The data collection experiments included four healthy participants aged between 22 and 30 years (two females and two males). The participants were asked to perform six types of emotion-aroused behaviors in our experiments. There were 48 sequences from four participants in total.

Preprocessing
Each joint data obtained from the Kinect sensor has eleven properties: color coordinates (X, Y), depth coordinates (X, Y), camera coordinates (X, Y, Z), and orientation coordinates (X, Y, Z, W). The Kinect's camera coordinates use the Kinect's infrared sensor to find the 3D points of the joints in space, and the camera space refers to the 3D coordinate system used by the Kinect. In this paper, we focused on the camera coordinates, which are needed to obtain 3D coordinate data. Nine joint coordinates (head, neck, torso, right shoulder, left shoulder, right elbow, left elbow, right hand, and left hand) from the human upper body were collected since they are significant for emotion. In other words, the dimensions of joint coordinates were 27.
As mentioned above, there are 24 features from human arms, which include 14 joint angles and 10 joint angular velocities. For modality fusion, the feature-level fusion was employed to concatenate three types of feature vectors into a larger feature vector. The total number of dimensions of emotion-aroused human behaviors was 51.

Unsupervised emotion recognition methods
RNNPB, as an unsupervised learning method, was employed to learn multimodal sensorimotor behaviors and classify human emotions by the corresponding spatiotemporal sequences. (22,23) RNNPB is substantially a RNN of the Jordan or Elman type. Here, the Elman-type RNN architecture was used. (24) Figure 5 shows the structure of unsupervised RNNPB of the Elman type. (23) This RNNPB consists of five types of layers: input layer, hidden layer, parametric bias units (PB layer), context layer, and output layer. The input of hidden layers (y h ) includes three parts; the details are expressed as where k is the time step, and k is omitted to avoid agitation if the parameters express the states from the same time step in one expression. w hi is the weight between the hidden layer and the input layer, w hp is the weight between the PB layer and the hidden layer, and the weight connecting the hidden layer with the context layer is denoted as w hc . g i (k) is the activation function, and the subscripts i and h are related to the parameters of the input and hidden layers. PB n (k) depicts the activation function of the PB layer.
The cost function during training is determined by where g k is the actual output, and N is the size of the output layer. The weights in the network obey the gradient descent and will be updated by the following equation: The learning rate of weights (γ ij ) is adjusted using the partial derivative of w ij after every epoch. The partial derivative of w ij can be positive or negative, which means that the sign is changing. The change in sign is determined by If ε ij > 0, the learning rate has to be increased by a factor, which is greater than one, to speed up convergence, and vice versa. The update of the learning rate can be expressed as max( ( 1) , Here, ζ − and ζ + represent the changing rate of γ ij , and ζ − < 1 is the decreasing rate, ζ + > 1 is the increasing rate, and γ min and γ max are the minimum and maximum values of γ ij , respectively. The sigmoid function proposed in Ref. 25 is used for all neurons in RNNPB, as well as for the transfer function in the PB layer: where x denotes the input vector to the neurons in the hidden and output layers. The RNNPB model is used to classify human emotions without the labeled datasets. For this method, the values of PB units indicate the corresponding emotions of datasets. Different sequences with the same emotion will result in similar PB values based on the method. Thus, human emotions are recognized in an unsupervised way. Because of the additional PB layer, the RNNPB model is endowed with generalization ability to untrained datasets. This means that although few samples are trained, a relatively stable recognition result will be obtained. Before the fused data is fed into the network, normalization is needed for the input features to enhance the accuracy and convergence speed of the model. The values of the normalized datasets range from zero to one. The multimodal RNNPB model is implemented in Python language.

Experiments
Motivated by our previous work, (17) we performed three experiments to recognize six human emotions and compare the clustering performance in different cases. The details will be introduced as follows.

Experimental setup
The experiments were performed with the same parameters for RNNPB to learn the spatiotemporal sequences of human behaviors. The parameters are shown in Table 1.
Except for the above parameters, the sizes of the input and output layers are not listed. The sizes of these two parameters are both equal to the dimensions of input data. Since the dimensions of the input data for each experiment are different, the sizes of the input and output layers are different.

Experimental results
Three experiments were implemented to explore how the different modality sensor data affect emotion recognition results. Three types of data sets were fed into RNNPB for training. In the first experiment, 12 sequences with 41 dimensions (two sequences for each emotion) that include the joint coordinates and angles were provided as the input of the network. For the second experiment, the same type of emotion data was used to recognize emotion with the same parameters for the network. Different from the first experiment, the dimensions of the input data were 51 and the additional 10 dimensions are human joint angular velocities including those of both the left and right arms. Note that all the training data sets regarding the first and second experiments were captured from one person. With respect to the third experiment, 24 sequences (four sequences for each emotion) expressing six emotions were trained to classify emotions. The data structure was the same as in the second experiment. However, the data sets were collected from four different people. Since the previous experiment was conducted using the single modal data (coordinates) based on RNNPB, only the results between the merging of information (joint coordinate and angle) and the fusion of different multimodal sensor data (data collected from the Kinect sensor and Myo armband) were compared in this paper. (16,17) The PB values of the first and second experiments are shown in Figs. 6 and 7, respectively, and the corresponding results of the third experiment are presented in Fig. 8. The same shapes of the markers express the same motion, and the markers with different colors and the same shapes imply different sequences for one emotion in Figs. 6-8. The annotations of "angry1", "angry2", "angry3", and "angry4" express different sequences for angry emotions in Figs.       where m is the total number of samples, h(x (i) ) is the predicted value of the i-th sample, and y(i) is the actual value of the i-th sample.

Analysis of experimental results
According to the experimental results, it is not difficult to find that the PB values corresponding to the same emotions are clustered together, and the RMSE is convergent to a small certain value. To investigate how the results vary with different modality data, the recognition performance characteristics of the first and second experiments were evaluated on the basis of the above results.
The emotion recognition performance was assessed from two perspectives. The first is the distance of PB values corresponding to different emotions. The quantitative confusion matrices are given in Figs. 12 and 13, which present the distance among various PB values in the PB space to evaluate the clustering results of the first and second experiments. Since the PB value in the PB space is a point, the distance between two PB values can be computed using Eq. (1). The distance of the PB values includes the intraclass distance d w , interclass distance d b , and relative distance d r . d r is calculated using the maximum d b divided by the average of d w . The intraclass distance reflects the aggregation level of the same class, and the interclass distance reveals the scattered level of different classes. The relative distance expresses the relationship between the intraclass and interclass distances. These distances reflect the clustering performance to some extent. In general, the desired clustering result is that d w is small, and d b and d r are large. The second point is the convergence speed and eventual values of RMSE. The detailed analyses and comparisons will be discussed with respect to these two points.
Firstly, the specific distances of PB values are listed in Table 2    emotions and a large intraclass distance from the same emotion (Fig. 6). As for the second experiment, the results in Fig. 7 clearly show that the intraclass and interclass distances are both smaller than those in the first experiment. However, the relative distance between the d w and d b of the second experiment is larger than that of the first one. Combining Fig. 7 and Table 2, we can conclude that the emotion recognition result of the second experiment is better. This implies that the joint angular velocity is useful for distinguishing different emotions and facilitating a smaller distance of the same class. In other words, the joint angular velocity may provide complementary information of emotions. Then, the convergence speed and the mean RMSE values of 200 epochs are compared by observing Figs. 9 and 10. We often expect a higher convergence speed and a smaller RMSE. In comparison with the first experiment, the convergence speed is much higher and the RMSE is slightly smaller than those in the second experiment. To sum up, the recognition results of the fused data from the multimodal sensor reveal a better performance than those of the single modal sensor data.
The first and second experiments merely classified six emotions from one person. Therefore, the third experiment was performed to investigate the effectiveness and stability of our proposed framework. Figure 8 shows the recognition results. The results imply that the RNNPB framework can basically classify six emotions from different people. Since there are differences in the behavioral expressions of different people to react to the same emotion, the classification results are slightly inferior to those of the first or second experiment. The expressions of sensorimotor behaviors are regulated by the internal emotion states; (16) there are also some common critical features for the same emotion of different people. This can be proved by the classification results. According to Figs. 6-8, we can find that the results of sad and neutral emotions are better than those of other emotions. That is probably because the external behavior and internal state of two emotions are both very similar.

Conclusions
In this paper, an unsupervised RNNPB framework was proposed to classify human emotions using multimodal sensor fused data. Multimodal data were the spatiotemporal sequences of emotional human behaviors, which were collected by a wearable Myo armband and a Kinect sensor containing human joint coordinates, angles, and angular velocities. Then, three experiments were performed to explore how multimodal data affect the emotion recognition results and to evaluate the stability of the RNNPB framework. The experimental results showed that multimodal fused data can markedly increase the relative distance between intraclass and interclass, and decrease the intraclass distance and RMSE compared with the single modal sensor data. The qualitative and quantitative analysis and evaluation results demonstrated the effectiveness of our proposed RNNPB framework. Moreover, these experimental results also indicated that signals from different modalities provide complementary information, and that the multimodal information can be integrated to enhance the robustness of the emotion recognition system compared with a single modal framework.
In the future, we will combine visual information (facial expression), auditory information (voice), and human behaviors to construct a more robust and effective emotion recognition system to enhance the clustering performance.