Detection of Head Motion from Facial Feature Points Using Deep Learning for Tele-operation of Robot

We propose an interface for the tele-operation of a laparoscope-holder robot via head movement using facial feature point detection. Fourteen feature points on the operator’s face are detected using a camera. The vertical and horizontal rotation angles and the distance between the face and the camera are estimated from the points using deep learning. The training data for deep learning are obtained using a dummy face. The root-mean-square error (RMSE) between the estimated and directly measured values is calculated for different numbers of nodes, layers, and epochs, and suitable numbers are determined from the RMSE values. The trained data are evaluated with four subjects. The effectiveness of the proposed method is demonstrated experimentally.


Introduction
In the tele-operation of a robot relying on indirect vision from a camera, the view angle of the camera affects the operation. (1)(2)(3)(4)(5)(6) An interface to control the camera to provide a suitable view to the operator through head motion is effective. A laparoscope-holder robot named EMARO is used in minimally invasive surgery. The robot is operated by the head motion of the operator (a surgeon wearing a cap with a gyroscope). The pitch and yaw motions are controlled by the movement of the head while pushing the foot pedal. (7,8) However, the wiring of the gyroscope may interfere with the operation. The use of eye tracking for control is one solution. Interfaces that can be used to operate robots using eye tracking have been proposed. (9)(10)(11)(12)(13)(14) However, controlling zoom in and out of the camera is difficult with eye tracking.
We have proposed an interface for the tele-operation of a camera, in which images of two markers attached to the operator's head are tracked by visual odometry. (15) However, misalignment of the markers leads to a risk of malfunction. To solve this problem, we previously proposed an interface for robotic tele-operation involving head movement and facial feature point detection. The feature points on the operator's face were detected using a camera and the position and posture of the face were calculated using these points. (16) Although the effectiveness of the interface has been experimentally demonstrated with a laparoscope holder, improvements in estimation accuracy are desired.
In this paper, we propose a method of improving the accuracy using deep learning. We set 14 feature points on the operator's face as the input for deep learning. The outputs are the horizontal and vertical rotation angles of the operator's head and the distance between the camera and the face. The training data for deep learning are obtained using a dummy face. The root-mean-square error (RMSE) between the estimated and directly measured values is calculated for different numbers of nodes, layers, and epochs, and optimal numbers are determined from the RMSE values. The rotations and the distance are estimated with four subjects.  [1][2][3][4][5][6][7][8][9][10][11][12][13][14] are detected as the input for the deep neural network. Considering the application of the interface in surgery, the operator wears a mask. However, if we put some markers on the mask as feature points, our interface can be applied.

Setup of interface
The horizontal and vertical rotations of the operator's head (θ H and θ V ) and the distance between the camera and the face (L) are calculated using deep learning. The gazing point on the monitor is calculated from θ H and θ V . The point is displayed on the monitor as a red dot. The diameter of the dot is inversely proportional to L.
The monitor is divided into a 3 × 3 grid. The operator controls the laparoscope holder robot while gazing at the monitor. The robot remains in the same position when the dot is in the center area of the monitor. When the operator gazes at other squares of the grid, the robot  moves at a constant velocity until the target object enters the center area. If the estimation accuracy of θ H and θ V is low, the intuitiveness of the operation deteriorates. Improving the accuracy of angle estimation will improve task efficiency. The initial distance between the monitor and the operator is 600 mm. The robot starts to zoom in when the distance becomes shorter than 550 mm and starts to zoom out when the distance becomes longer than 650 mm.

Estimation of face motion using deep learning
In this section, we explain the method of estimating the face motion from the 14 feature points using deep learning. We used a dummy face for the input of the deep learning. The height and width of the dummy face shown in Fig. 2 were 210 and 115 mm, respectively. The dummy face was sinusoidally rotated for horizontal and vertical directions at a frequency of 0.25 Hz. The rotational range was determined to be −30 to 30 deg for the horizontal direction and −25 to 25 deg for the vertical direction. The experiments were performed at L = 450, 500, and 550 mm. This is because the size of the dummy face is about 5/6 that of an adult face. Figure 3 shows the structure of deep learning. Chainer (Preferred Networks, Inc.) was used for deep learning. The input data were 14 facial features detected during both horizontal and vertical motions, totaling 28 facial points. The output data were θ H , θ V , and L. The input and output data were recorded at 30 frames per second. Of the recorded data, 3600 are used as training data and 1200 are used as validation data after learning. Rectified linear unit (ReLU) and adaptive moment estimation (Adam) were used for the activation function and the optimizer. To learn the optimal weighting data by deep learning, it is necessary to set appropriate numbers of layers, nodes, and learning times. The optimal numbers of layers and nodes were determined from among 1, 2, 4, 8, and 16 layers and 10, 20, 30, and 50 nodes Figure 4 shows the calculated θ H and θ V for different numbers of nodes with two layers. The horizontal axis shows the number of nodes and the vertical axis shows the root of square error (RSE) for the calculated angles. The calculation error is minimized with 30 nodes. The same tendency can be observed for different numbers of layers. Figure 5 shows the calculated results of L for the different numbers of nodes with two layers. The horizontal axis shows the number of nodes and the vertical axis shows the RSE for the calculated length. We determined the optimal number of nodes to be 30 from the calculated results.  We determined the optimal number of layers to be two from the calculated results. Figure 8 shows the learning situation. Since the accuracy peaked when the number of epochs approached 500, this number of epochs was selected. The training in deep learning was performed with 3600 data values using 30 nodes, 2 layers, and 500 epochs. Then, we performed the validation using 1200 data values. Figure 9 shows the estimated results of θ H for L = 300 mm. The same distance as in Ref. 16 was selected for comparison. The RMSE was 4.36 deg. In Ref. 16, we obtained the position and posture of the face by solving the perspective-n-point (PnP) problem. The results are shown in Fig. 10. The RMSE was 6.42 deg. The phase delay observed in Fig. 10 is due to the computation time to solve the PnP problem. In Fig. 9, the phase delay cannot be observed because the computation time is less than 1 ms. It is clear that the proposed method is more accurate by comparing Figs. 9 and 10.

Experiments with Interface
Four subjects were tested with the interface trained using the dummy face, as described in Sect. 2.

Experimental procedure
The proposed interface shown in Fig. 1 was tested with four subjects using the data trained with deep learning. A gyro sensor (MPU9250/6500, HiLetgo, China) was mounted on the subject's head as a reference for the estimated angles. The experiments were performed with the lengths of 550, 600, and 650 mm.
The subjects were asked to rotate their face horizontally by −10, −20, 10, and 20 deg while watching the output of the gyro sensor from the initial posture facing forward. The clockwise motion was set to be the positive direction. Then, the subjects rotated their face vertically by −10, −20, 10, and 20 deg, where the upward direction was set to be the positive direction.   Figure 11 shows the experimental results of the horizontal rotation for lengths of 550, 600, and 650 mm. The average RMSE of four subjects was plotted on the longitudinal axis. The value was minimum when L = 600 mm. Figure 12 shows the experimental results of the vertical rotation for the three different lengths. The estimation accuracy was high for the     vertical rotation. This is considered to be due to the arrangement of the feature points. The heights of the feature points 2-14 were larger than the widths of points 1-4. Since the height was larger than the width, the feature points were markedly changed upon vertical rotation of the face. Therefore, the estimation accuracy was higher in the vertical rotation than in the horizontal rotation.

Experimental results
Next, the subjects performed sinusoidal rotational motion. The initial length from the camera was 600 mm. The subjects rotated their face by about 20 deg while watching the output of the gyro sensor. Figures 13 and 14 respectively show the experimental results of the horizontal and vertical rotations. The black line shows the output of the gyro sensor. The red line shows the results estimated from the feature points of the face using deep learning. The RSE values were 4.78 and 3.53 deg, respectively. The calculation delay resulting from the time taken to solve the problem (16) was about 75 ms compared with only 25 ms for the proposed method with deep learning. There was a slight phase delay in the vertical rotation. The dummy face was smaller than the subject's face; therefore, the detection accuracy of the feature points on the chin was low. The accuracy can be improved by changing the size of the dummy face. However, the black and red lines are in good agreement. The effectiveness of the proposed method was confirmed from the experimental results.

Conclusions
In this paper, we proposed an interface for the tele-operation of robots via head movement using facial feature point detection. The vertical and horizontal rotation angles and the distance between the face and the camera were estimated from 14 feature points on the operator's face using deep learning. The training data for deep learning were obtained using a dummy face. The RMSE between the estimated values and the values directly measured using sensors was calculated for different numbers of nodes, layers, and epochs. We found that 2 layers, 30 nodes, and 500 epochs were suitable for deep learning. The trained data were evaluated with four subjects. We confirmed that sinusoidal head motion is effective for training in the proposed method.