Using Machine Learning to Estimate Difficulty Levels of Problems

In an e-learning environment in which a teacher cannot interact directly with a student, it can be difficult to ascertain a student’s difficulty with a subject. In this study, machine learning was used to estimate the level of difficulty of problems experienced by a student to ensure that problems of appropriate difficulty are provided. JINS MEME smart eyewear was used to measure the head movements of students and their results were used to estimate the subjective difficulty that they experienced. Our experimental tests demonstrate the F 1 -scores of machine learning for 10 users who were given calculation, kanji (Chinese characters), and programming problems. The feature importance scores of the random forest (RF) were calculated, and the dependence of F 1 -score on the type of user was examined. It was found that the mean of the yaw angle was the most important feature in all cases, indicating that the horizontal rotation of the head may depend on the difficulty of the problem.


Introduction
In order for classes to progress in a flexible manner, teachers in classrooms should be able to easily understand the subjective level of difficulty a student has with a certain topic. Generally, in one-on-one classes, teachers can adjust the level of difficulty on the basis of a student's facial expressions and gestures, allowing classes to proceed according to the student's ability. However, learning environments such as e-learning and remote classrooms can make it relatively difficult for instructors to accurately determine difficulty levels.
Learning through web-based teaching materials such as e-learning videos has been introduced in several educational fields in recent years. However, analyzing the situation of students is a difficult task for the teachers. For this reason, e-learning requires the ability of the student to effectively understand the context. Ohkawauchi et al. (1) investigated the estimation of subjective difficulty experienced by students in the case of lecture videos for e-learning and demonstrated that actions such as pausing and rewinding are correlated with subjective difficulty. Nakamura et al. (2) studied the estimation of the subjective difficulty of students during e-learning using a camera to obtain facial characteristics (such as face tilt angles, gaze positions, and whisper time) from images of students' faces. The level of subjective difficulty was estimated using a support vector machine (SVM). Studies on the estimation of student behaviors and the state of web-based lectures and e-learning resources have also been conducted. (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15) These studies estimated the level of subjective difficulty by measuring eye movements, which were less affected than other phenomena by differences between individuals. (3,4) Shigeta et al. (3) studied the level of subjective difficulty of English listening by analyzing data collected by estimating eye movements. The data were gathered using the Freeview software developed by Takei Kikai Kogyo Co., Ltd. The results indicated significant differences in eye movement speed, gaze time, and number of blinks among learners. Okoso et al. (4) studied the subjective difficulty of English words in English documents using a deep learning approach in combination with gaze information measured using a Tobii eye tracker. Hence, it is possible to provide a word-based exam appropriate for the level of understanding of an individual. However, in both these studies, (3,4) a dedicated device was required to measure eye movements, which can only be measured when the person is seated in front of a personal computer. Therefore, in the present paper, we consider the use of a wearable device that is not limited to use only in front of a computer, removing the necessity for a camera. The purpose of this study was to use machine learning techniques to measure the head movements of students using a glasses-type wearable device to estimate the subjective difficulty they experienced.

JINS MEME
JINS MEME smart eyewear was used in this study as it has nearly the same design and feel as ordinary glasses. There are three types of JINS MEME eyewear: (1) MT, which can measure acceleration and angular velocity; (2) ES_R, which can measure electrooculogram raw data in addition to acceleration and angular velocity; and (3) ES, in which the installed sensor is the same as that of ES_R but no raw electrooculogram data can be collected. ES can measure the speed and strength of blinks using the JINS MEME application programming interface (API).
Studies using JINS MEME have been conducted. (16)(17)(18) Ogawa et al. (16) studied the estimation of workload by measuring blink data and utterance data when practicing a video game (Tetris) by varying the workload. Nagao et al. (17) studied the various states of students when they are learning, such as listening or note taking. In a study of subjective difficulty estimation using JINS MEME, Mori et al. (18) considered four-choice questions on English vocabulary: in their study, to construct a system that supports efficient self-study, JINS MEME and the chest-mounted device myBeat, which can measure characteristics such as heart rate and RR interval (RRI), were used for subjective difficulty estimation; the time required for answering was added to the information obtained from JINS MEME ES_R and myBeat, and estimation was performed on the basis of these features.
In the present study, ES was adopted to verify the possibility of estimating difficulty from converted data from the JINS MEME API instead of raw data. Each sensor value of the ES can be recorded at a sampling frequency of 20 Hz using an application connected to an iOS or an Android device via Bluetooth.

Machine learning
The features used to estimate degree of difficulty have 30 dimensions (six features multiplied by five basic statistics). The six features consist of head movements: means of x-, y-, and z-axis acceleration, roll, pitch, and yaw angle over a time window of 2 s. The five basic statistics consist of mean, standard deviation, maximum value, minimum value, and median. A time window of 2 s is often used in studies employing acceleration sensors. The resulting 30-dimensional feature is standardized to have a mean of zero and a variance of one, and machine learning is performed using a 10-dimensional feature through principal component analysis (PCA). Four methods were used: SVM, random forest (RF), decision tree (DT), and k-nearest neighbor (k-NN). DT and RF were used because their classification rules are easy to understand and the corresponding data visualization is also convenient, and SVM can perform binary classification with high accuracy, as has been reported in previous studies. (2,3,18) k-NN, which is the most basic classification method, was used owing to its simplicity.
The JINS MEME ES enables measurement of the speed and strength of a blink as well as the acceleration and angular velocity of the head. First, we measured the blinks and head movement and used these to perform estimations using machine learning. However, the number of blinks is known to generally decrease from 20 times per minute in normal activity to 10 times a minute during reading and five times a minute when working on personal computers. Therefore, blinks were often not detected when the time window was 2 s. Thus, the time window was set to 20 s, and the feature importance score of RF was calculated. It was found that the feature of the head movement was the most important and that the number of blinks did not provide any useful insights. For this reason, we excluded blink data from the features described in this study.

Experimental methods
Studies focused on calculation problems (2) and English language problems (3,4,18) that estimated learners' difficulty levels have been conducted, and JINS MEME eyewear has also been utilized. (18) In this study, we considered three types of problems, namely, calculation problems and kanji (Chinese characters) problems, whose difficulty levels can be easily adjusted, and programming problems, which are likely to generate individual differences with regard to ability. The calculation problems were fill-in-the-missing-number problems. An example of an easy calculation problem is " ■ ÷ 3 = 1" and that of a difficult problem is "( ■ − 9 / 6) = 3 ÷ 0.125". The kanji problems consisted of writing the corresponding kanji of words given in hiragana. For the kanji problems, we used the third and seventh levels of the Japan Kanji Aptitude Test. The programming problem involved the written part of a university entrance examination. Figure 1 shows a view of the experiment. We recruited 10 participants, all of which were males aged between 20 and 22 years. All participants were informed of the purpose and content of the study and agreed to privacy protection. Furthermore, the Research Ethics Board of the National Institute of Technology, Ishikawa College approved this study through an ethics review.
The experimental procedure for the calculation problems was as follows: (1) Each participant was made to sit on a chair and wear the JINS MEME eyewear.
(2) They were asked to solve a problem within a time of 6 min.
(3) They were then asked to take a break for 3 min. (4) They were then asked to solve another problem within 6 min. (5) After completion of the tasks, the participants were asked to state whether step (2) or step (4) was more difficult. The choices were labeled as easy or difficult. These steps were repeated for the kanji and programming problems. The first minute of the 6-min-long test data that were recorded was excluded because the corresponding data were unstable. We prepared several problems that could not be solved in 6 min, and none of the participants were able to solve them. The data from the second to the fifth minute were categorized as training data (number of data: 240), and the data from the last minute were categorized as test data (number of data: 60). When data are randomly sampled and trained, there is the potential for data leakage; we selected our experimental method to avoid this problem. The training data were examined to find the parameter that maximizes the score of 10-fold cross-validation (CV), and the model was trained using this parameter. Next, the F 1 -score (also F-score or F-measure), which is a measure of a test's accuracy, was verified to evaluate the estimations.

F 1 -score for each user
The experimental data were used to evaluate the estimation for each user. The results obtained are presented in Tables 1-3. The rows in Tables 1-3 represent the users and the columns represent the learning methods.
For the calculation problem, the F 1 -scores were 85% with SVM, 89% with RF, 80% with DT, and 81% with k-NN. However, in the case of user J, the average F 1 -score was 45%, which was lower than that of the other users. In the case of the kanji problems, the F 1 -scores were 90% with SVM, 87% with RF, 83% with DT, and 88% with k-NN. In the programming problems, the F 1 -scores were 77% with SVM, 75% with RF, 72% with DT, and 74% with k-NN, and for several users, the F 1 -score was more than 70%. However, the F 1 -score was low for users C, G, and J.
Feature importance of RF was evaluated, and consequently, the three most important features were recorded for each user, with the results presented in Tables 4-6. It was found    Table 6 Important features (programming problems).  Table 5 Important features (kanji problems). that the mean of the yaw angle is the most important feature in all cases. This indicates that the horizontal rotation of the head may depend on the difficulty of the problem. It appears that this was because the speed of problem-solving depends on the level of difficulty of the problem. In general, when solving a problem, the head tilts and nods, and therefore, the pitch angle, which represents the vertical movement of the head, is likely to be related to the level of difficulty. However, in this experiment, only a few users exhibited this tendency, and therefore, the yaw angle was found to be the most important feature.

Dependence of F 1 -score on the type of user
The dependence of F 1 -score on the type of user was evaluated using experimental data. In particular, the model was trained using data from each user (number of data: 300), and F 1 -scores for all users' test data were evaluated accordingly. The SVM is only used for evaluation, and the F 1 -scores are presented in Tables 7-9. It was observed that the responses to calculation and programming problems vary greatly among individuals. Depending on the user, the F 1 -score was as high as 100% or as low as 1%. However, the kanji problems appeared to be easiest to characterize using the users' results. For users A-H, the F 1 -score was at least 50%, and in most Table 7 Dependence of F 1 -score on the type of user (calculation problems).
Users (test data) We suppose that the movement of these users was similar because it depends on the problem's difficulty. In such a case, the F 1 -score of difficulty estimation is more dependent on the type of user. Therefore, we conclude that user dependence may be reduced by employing learning data from users with similar abilities.

Conclusions
In this study, we used a machine learning approach to estimate the degree of difficulty experienced with different types of learning content based on the head movements of students. We used JINS MEME eyewear, which does not require a camera, to track the head movements of students and estimate the difficulty of problems. Our results show the high F 1 -score of the proposed approach. The most important features of RF were examined, and the yaw angle, which represents the left-right head rotation, was found to be the most important feature in all the cases. Additionally, when the dependence of the F 1 -score on the type of user was examined using other training models, we observed significant differences in the results depending on the particular student and type of problem. For future work, the F 1 -score will be examined by increasing the number of participants and considering other factors such as age, gender, and ability. This will enable us to develop a highly accurate approach to determining difficulty levels using machine learning, which may be applied in the rapidly expanding field of online learning.