Automatic Labeling Framework for Wearable Sensor-based Human Activity Recognition

Labeled datasets are one of the key factors for obtaining a good and robust classifier using supervised learning methods. However, labeling raw data is a tedious and labor-intensive process, which is usually done manually. Many efforts were proposed to utilize a small amount of labeled data to train a classifier that is sufficiently robust to label more data for training or make a prediction on unlabeled data. Unlike previous studies, we proposed an automatic labeling framework without labeling a small amount of data in advance, to directly annotate unlabeled time series data regarding body-worn sensor-based human activity recognition (HAR) in laboratory settings. The framework automatically labels collected time series activity data by transforming the original data into its corresponding absolute wavelet energy entropy and detects activity endpoints based on constraints and information extracted from a predefined human activity sequence. The performance of the proposed framework was evaluated on the collected dataset and the UCI HAR Dataset. In both cases, the average precision and recall scores are above 81.9% and the average F-measure scores are above 88.9%. Results showed that the proposed framework can be adopted as a rapid and reliable way of generating labeled datasets from unlabeled data.


Introduction
Since the 1990s, researchers have begun to use wearable sensors for human activity recognition (HAR). (1) From the perspective of sensor types, research studies on HAR mainly include vision sensors (such as cameras), ambient sensors in smart home scenes, and wearable sensors [such as accelerometers, gyroscopes, and inertial measurement units (IMUs)], which mainly apply supervised learning methods to learn different human activity patterns from collected human motion data. Therefore, acquiring a proper set of labeled data is the basis for training HAR models. Annotation techniques can be generally classified as offline and online methods. Specifically, offline methods include self-recall, (2) indirect observation, and video and audio recordings, (3,4) and online methods include direct observations, (5) time diary, and experience sampling. (6,7) For studies acquiring data in a laboratory scenario, direct observation and video recordings are usually taken as the annotation methods, which might also be called on-site annotation and post hoc annotation, respectively. (8) For the former method, an annotator records the timestamp range of human activity currently performed by a subject. For the latter method, an annotator needs to compare between the video footage of the whole data acquisition process and the acquired time series to complete raw data annotation. (9,10) For long-term HAR studies, especially focusing on activity monitoring, online methods are more appealing for realistic application whereas offline methods almost make it impossible to obtain ground truth labels, which are always labor-intensive and usually unacceptable due to privacy concerns. (11) There is a tradeoff between the accuracy of an annotation method and the time and effort required for annotation. Offline methods can provide more accurate annotations than online methods but demand massive effort, especially when the set of data is large. Although online methods are less time-consuming, inaccurate annotations and more ambiguity may be introduced to the labeled dataset.
In this work, we target supervised learning-based applications that mainly acquire data in laboratory settings. In such controlled settings, annotations can often be obtained by video recordings or direct observations. For annotators, video recordings are generally easier to interpret than time series data, and annotation efforts will increase when the numbers of activities and subjects become larger and the data acquisition time becomes longer. To obtain detailed annotations with acceptable accuracy and reduce the annotation efforts in post hoc labeling settings, we proposed a novel automatic labeling framework (ALF) towards multivariate time series data acquired from multiple wearable sensors. Unlike previous methods based on machine learning, we tackle this problem from a speech processing perspective. The proposed framework consists of two main steps: (1) inserting a rest position between two human activities during data acquisition to create a human activity sequence (HAS), and (2) extracting a wavelet energy entropy (WEE) feature from HAS and detecting endpoints of each human activity. The proposed ALF can accelerate the process and reduce the cost of data annotation.
The rest of the paper is organized as follows. In Sect. 2, we provide a brief overview of related work on reducing labeling efforts. In Sect. 3, the proposed ALF is demonstrated in detail. Experimental results of the proposed ALF are discussed in detail in Sect. 4. Finally, conclusions are presented in Sect. 5.

Related Work
Supervised algorithms with high recognition performance usually require significant amounts of labeled training data. In previous HAR studies based on wearable sensors, especially for those using supervised algorithms, manual methods, e.g., video recordings and direct observations, were widely used to obtain annotations. Plotnik et al. developed a wearable assistant for Parkinson's disease patients with freezing of gait (FOG) symptom. (12) In the data acquisition phase, two annotators were assigned to conduct on-site annotation. One was using a digital video camera to record subjects' activities including standing, walking, turning, and freezing, while the other was assigning the corresponding labels in real time to the acceleration data transmitted from a wearable assistant on a laptop. A physiotherapist then determined the endpoints of FOG events in the acquired data according to a post hoc analysis on the video recordings. Similarly, Anguita et al. used a smartphone (Samsung Galaxy S II) to collect human motion data and perceived different types of human activities on the basis of ambient information. (13) Acceleration and angular velocities of typical daily activities, including standing, sitting, lying, walking, going downstairs, and going upstairs, were collected from 30 subjects aged 19 to 48 years. Each subject was requested to perform two rounds of each set of human activities and to rest for 5 s between each round. Acquired data were manually labeled afterwards according to the video footage of subjects' activities. Banos et al. attached two IMUs to each subject's right wrist and left ankle and another sensor to the chest, which provides two-lead ECG measurements. (14) Thus, acceleration, angular velocity, geomagnetic information, and ECG of 12 different outdoor human activities were collected among 10 volunteers. The entire data acquisition process was recorded by a video camera and then manually annotated. Furthermore, there are many other public HAR datasets based on wearable sensors or portable devices available online in which acceleration, angular velocity, and geomagnetic signals are mostly collected, and some vital signals are also acquired for future work purposes. (15)(16)(17)(18) Detailed annotation with high accuracy can be obtained by manual annotation methods with intensive labeling efforts.
In long-term human activity monitoring settings, supervised algorithms face the challenge of obtaining a labeled dataset. To cope with the problem of insufficient labeled data, some researchers have moved from fully supervised settings to weakly supervised ones so as to reduce annotation efforts. By incorporating with experience sampling, multi-instance learning (MIL) obtains knowledge from a significantly weak labeled dataset in which labels are associated with sets (bags) of instances, instead of training instances. In this way, sensor data can be labeled on a very coarse level, which significantly lowers the annotation burden. A bag is labeled positive if and only if at least one positive instance, i.e., the activity we are interested in, exists in the bag, and negative if all instances in the bag are negative. The first work of MIL on time series data for HAR was Ref. 11 and it adapted modifications of Support Vector Machine (mi-SVM) to three different bag-labeling scenarios (i.e., single-labeled bags, multilabeled bags, and majority-voting bags). The extensive study and comparative evaluation proved the capability of significantly decreasing the annotation efforts in the proposed MILbased methods. Guan et al. proposed a novel MIL model based on the work described in Ref. 11 for offline activity recognition from multivariate time series data, and it is a generative graphical model based on an Auto-Regressive Hidden Markov Model (HMM), which can predict both bag and instance labels. (19) Unsupervised learning techniques are usually used to discover the underlying structures in activity data without the necessity of providing labels. Wyatt et al. viewed activity data as a stream of natural language terms, i.e., sequences of object use, and generic models mined by daily activities from the web, which served as common sense in HAR. (20) Bottcher et al.
proposed an unsupervised framework of adopting clustering algorithms to detect transitions between steps of manual work that follows a (semi-) fixed procedure. Although the order and/or number of steps in the process may be given in advance, the framework removes the necessity for labeled data. (21) By using unsupervised techniques, the effort of labeling activity data is unnecessary.
Transfer learning is defined as the ability to extend what has been learned in one context to new contexts and relies on the assumption that some underlying relationship between the source and the target areas exists and allows for the successful transfer of knowledge from the source to the target. (22,23) In a smart home setting, Kasteren et al. proposed three different function groups to project extracted features to a common space and then used both unlabeled data from house A and labeled data from house B to learn the parameters of a semisupervised HMM for activity recognition in house A. (24) In another work, Kasteren et al. transferred the knowledge obtained by using existing labeled data from various homes to an HMM model applied in a new home. (25) Chen et al., however, proposed a transfer learning framework based on principal component analysis (PCA) transformation, Gale-Shapley similarity measurement, and Jensen-Shannon divergence (JSD) feature mapping. (26) Semisupervised learning makes use of only a small amount of labeled training data and a substantial amount of unlabeled training data. (27) For example, self-training, cotraining, (28) and En-Co-Training are some typical semisupervised techniques whereas a special case of semisupervised learning, (29) i.e., active learning, mainly focuses on labeling the most profitable instances, but human intervention is necessary to some extent for a small amount of labeled data. For example, Zhao et al. proposed a robust active learning model using crowdsourced annotations for activity recognition. (30) The aforementioned methods addressed the issue of data annotation by mainly using two kinds of annotation techniques, which adopt either intensive labeling efforts or learning methods. As the research moves from a laboratory setting to a real-world setting, detailed labeled data are more difficult to obtain. As a result, using a small amount of labeled data and prior knowledge of the target activities to train a learning model that can classify activity of daily life (ADL) with acceptable accuracy is the core idea of MIL, unsupervised learning, transfer learning, semisupervised learning, and active learning. However, manual labeling of the initial data is required to some extent by these methods. Thus, we propose a framework to achieve automatic annotation of time series activity data in a laboratory setting, which is based on the prior knowledge of the acquired data and the WEE feature to automatically detect endpoints of human activities.

Proposed ALF
The fact that it is difficult to interpret time series data generated by wearable sensors, such as IMUs, makes it necessary to refer to video recordings of the data acquisition process when the manual annotation is conducted so as to guarantee labeling accuracy. Video recording methods are still widely used in data annotation in a laboratory setting when supervised learning methods are adopted to train an activity classifier because it can be seen as a kind of prior knowledge of the acquired data, which can help annotators interpret time series data during labeling. However, when the number of subjects taking part in the data acquisition increases or the duration of data acquisition per subject becomes longer, labeling efforts would be intensive and time-consuming. Automatic methods to address this issue should take into account the prior knowledge of acquired data. In this work, we consider the information contained in the acquired sequence of human activities as a type of prior knowledge of time series data, e.g., the order of activities, the longest/shortest duration of acquired activities, the longest/shortest duration of rest posture (RP), and the possible lowest acceleration during an activity.
Therefore, we proposed an ALF including HAS, feature extraction, and automatic labeling as shown in Fig. 1. On one hand, HAS is predesigned as a data acquisition scheme for acquiring time series activity data. On the other hand, HAS provides prior knowledge, i.e., constraints and the activity sequence for automatic labeling. Details of each step regarding this framework are demonstrated in the following sections.

HAS
HAS is a set of time-aligned human activities that are predesigned before conducting the data acquisition process. In the real world, the boundary between different human activities is not distinct, which makes it challenging to segment two different but consecutive human activities. (1) For instance, Refs. 12 and 18 collected consecutive human activities with no apparent pause state between every two human activities. There are mainly two different ways of labeling a time series data. One is to label every single activity that is performed by each subject while the other is to label specific activities of interest. Figure 2(a) is subject 1's acceleration and corresponding labels in Ref. 13. Each human activity (i.e., walking, ascending stairs, descending stairs, sitting, standing, and lying down) is labeled as a decimal number from one to six, respectively. Figure 2(b) shows the acceleration of subject 2's activity during data acquisition in Ref. 14. Only activities of interest (i.e., standing and ascending stairs) are labeled. When special locations are required to conduct data acquisition for different kinds of activities, irrelevant activities were created and labeled as number zero between two different activities of interest, e.g., the subject needs to move from the laboratory to a building with stairs so as to ascend the stairs. Lee and Xu (31) and Amft et al. (32) introduced a predefined rest position between two different hand gestures in order to create distinct boundaries for different types of hand gestures. In this work, we inserted RP between each target activity performed by each subject during data acquisition in order to make the boundaries of human activities clearer for labeling. An RP is defined as the stationary status of a subject during data acquisition. For example, standing still, sitting, and lying down are three typical RPs a subject can simulate.
Therefore, we designed the HAS as shown in Fig. 3. It groups the activities of normal people in daily life into three different types, namely, stationary (STACT), quasi-periodic (QPACT), and sporadic (SPACT) activities. STACTs indicate that the subject stays in a static posture, i.e., standing still (STND), sitting (SIT), lying down (LD), and RP. QPACTs show a pattern of recurrence with similar movements during a subject's activities such as walking (WLK), ascending (AS) and descending stairs (DS), jumping up and down (JUD), and squatting slowly (SS) and quickly (SQ). SPACTs indicate activities that occur sparsely or accidently in our daily life with uncertainty such as falling, which is rare but of high risk. Particularly, we simulated two kinds of falls, i.e., falls with spontaneous protection (SP) and those without SP. A simulated fall with SP means that the subject spontaneously triggers self-protective actions, e.g., bending knees or stretching arms straight to mitigate the impact between the subject's body and the foam, to protect self from an impending and inevitable fall. In contrast, a simulated fall with no spontaneous protection (NSP) simulates a state when one is unconscious (e.g., faint or falling asleep) or being in a pathological state (e.g., having a syncope or hemiplegia) and failing to spontaneously trigger self-protective actions during a fall. Each type of fall is further categorized into four types of specific falls on the basis of the direction of falling, i.e., fall forward (FF), fall backwards (FB), fall to the left (FL), and fall to the right (FR). In total, time series data of 17 target activities are collected and denoted as decimal numbers from 1 to 17.
To date, some hardware and sensors have been adopted to acquire human motion data. These devices can be categorized into two types, namely, commercial products (Samsung Galaxy S2, HTC Magic, SHIMMER, and Xsens) and research prototypes (Wocket, 3dNX, GENEA, and e-AR). (33)(34)(35)(36)(37)(38)(39)(40) Commercial products such as a smartphone or a smartwatch may be uncomfortable to be attached to places other than wrists of a wearer while products such as SHIMMER and Xsens are expensive. Moreover, research prototypes developed by other researchers may be unavailable for sale or to obtain. To seek a low-cost miniature hardware that acquires motion data of human activities, we developed IMU modules, each of which is based on a Microcontroller Unit (MCU), STM32F103R, and a six-axis MEMS motion tracking device, MPU6050, which consists of a tri-axial accelerometer and a tri-axial gyroscope, as shown in Fig. 4(a). Nine IMUs were attached to each subject on nine different locations using hook-and-loop fasteners as shown in Fig. 4 . Six subjects (five males and one female) were chosen from students aged between 26 and 28. Each subject was asked to follow the experimenter's instructions to start or terminate a specific activity and to perform each of them thoroughly and completely according to their understanding of each activity. Also, slow and steady moves were mostly recommended to subjects during transition activities. Particularly, all simulated falls are self-initiated by each subject and all subjects are protected by a foam after a simulated fall. No further instruction on how to perform each activity is provided. During the process of data acquisition, each subject was asked to start performing activity 1 at location A and to end up with activity 17 at location E sequentially according to HAS. Moreover, each subject shifted the location as target activities are being performed as shown in Fig. 5, i.e., to start with sitting at A → standing at B → lying down at E → walking steadily (B ↔ C) → ascending stairs (C → D) → descending stairs (D → C) → jumping up and down at C → Each IMU module was designed to collect a stream of data with a duration of 10 min in a constant sampling rate of 100 Hz. As a result, a stream of activity data lasting 600 s was collected for each subject, and the durations of STACT, QPACT, and SPACT are approximately 105, 235 and 260 s, respectively. During data acquisition, some transition movements are inevitably created owing to the necessity of moving from one location to another so as to complete the whole process. Therefore, RP and transition movements should both be labeled as number zero in the following automatic labeling process. Figures 6(a) and 6(b) show two sets of FW acceleration  obtained from two subjects, the target activities of which have clearer boundaries when compared with Fig. 2. Note that the presented HAS is merely a scheme that cannot be seen as a direct source of providing ground-truth labels of the collected data owing to the fact that either a timekeeper or a subject cannot time or perform as promptly as the target activities planned in the HAS. That is, lead-lag timing or performing a target activity is inevitable and certainly has a lead-lag impact on the schedule of each activity to follow in the collected data.

Feature extraction
Various studies have been conducted on the endpoint detection regarding speech signals. (41)(42)(43) The main task of endpoint detection of speech signals is to detect the start and end points of a speech signal. The collected acceleration data in Figs. 6(a) and 6(b) are similar in morphology to a speech signal. That is, the stationary part of acceleration can be treated as the mute part in a speech signal while the oscillating part of acceleration can be treated as the part containing a speech. In this work, we consider the processing of collected data in a speech processing perspective.
Many algorithms have been proposed to tackle the endpoint detection of speech signals, which mostly aim to extract different features from the original signal, e.g., spectral entropy, cepstrum distance, and dual thresholds. Among various endpoint detection techniques, energy-based methods are the most widely applied solutions to this problem. (41) These algorithms are mostly based on the short-time Fourier transform (STFT). However, STFT has a fixed resolution, which might lead to poor time/frequency resolution in the analysis of nonstationary signals, e.g., a speech signal and a time series of acceleration collected here are both nonstationary signals. Moreover, the energies of different activities vary from each other. To locate the boundaries of activity data with better accuracy, we adopted multilevel one-dimensional (1-D) wavelet decomposition to the original activity data and extracted the corresponding detail coefficient of each level. In this way, the wavelet energy distribution along where the CWT of a given function f(x) is equal to the inner product between f(x) and the . In practice, the discrete wavelet transform (DWT) is mostly adopted and its parameters usually take dyadic values, i.e., a = 1/2 j , b = k/2 j ( j, k ∈ Z). Thus, the DWT of f(x) is denoted as Eq. (2), which is a dyadic orthogonal wavelet transform to create wavelet basis functions for multiresolution analysis (MRA).
To apply MRA to f(x) is to perform multilevel wavelet decomposition on f(x) and to reconstruct f(x) as Eq. (3). In this work, segments with oscillating acceleration contain high-frequency components while those with steady acceleration contain low-frequency components. Thus, the magnitude of detail coefficients indicates the energy of the signal segment. Thus, the wavelet energy of a J-level 1-D wavelet decomposition is defined as Eq. (4).
where Et is the total wavelet energy, i is the i-th level wavelet decomposition (i = 1, 2, ..., J), ED i denotes the detail energy of the i-th level wavelet decomposition, and EA J denotes the approximation energy of the J-th level wavelet decomposition. In combination with Shannon entropy, WEE indicates the distribution of human activities along all level wavelet decompositions and can be defined as Eq. (5). ( where p i is the probability density of the i-th level wavelet energy, p i = ED i /Et (i = 1, 2, L, J), and Enpy_WE denotes WEE. Assume the collected time series activity data as f(x), which is a vector of dimension M × 1. The sliding window technique is adopted to extract the WEE feature from f(x), which is to move a sliding window with a fixed length L along f(x) with a constant step size d. By centering the current data point in the middle of the sliding window, M frames of segments are created. In each frame, 10-level 1-D wavelet decomposition is adopted by selecting Daubechies-4 wavelets. As a result, a transformed version of f(x) is created. If subject 1's FW resultant acceleration is taken as an example, a segment of original unlabeled FW resultant acceleration from 0 to 300 s is shown in Fig. 7(a). After extracting the WEE feature, it is transformed to the form in Fig.  7(b) in which zero crossings of the value of WEE occur. This leads to a difficulty of selecting the optimal threshold of endpoint detection. Thus, the absolute value of WEE is calculated as shown in Fig. 7(c) from which the zero-crossing rate of |WEE| is lower than that in Fig. 7(b) qualitatively. However, from 200 to 240 s, a big group of data points with values approximating to zero still exists. To guarantee the quality of selected threshold for endpoint detection, average filtering was applied to smoothen |WEE|. Eventually, a soothing transformed version of f(x) is obtained as shown in Fig. 7(d), which is taken as the base signal for automatic labeling in Sect. 3.3.

Automatic labeling
The original signal f(x) is transformed into a smoothed absolute WEE of f(x), which is denoted as SA_WEE(x). In comparison with Fig. 7(a), the fact that SA_WEE(x) is all positive and usually similar to a bell curve within activity intervals enables threshold-based endpoint detection. In this work, automatic labeling is divided into three main steps that include preliminary segmentation, endpoint detection, and assign labels.
In the preliminary segmentation stage, the primal goal is to find a suitable threshold that can distinguish activity intervals from stationary intervals in SA_WEE(x). Activity intervals contain data points with larger values of SA_WEE(x) than the selected threshold and indicate QPACT, SPACT, and some transition activities, i.e., SIT to STND, STND to LD, LD to STND, turning around before DS, WLK from location C to B as shown in Fig. 5 after finishing JUD, and LD to STND after each simulated fall. In contrast, stationary intervals contain data points with no larger values of SA_WEE(x) than the selected threshold and indicate STACT and RP. As observed from collected activity data, e.g., as shown in Fig. 6, the duration of all activity/ stationary intervals is approximately 50% of the total sampling time. Denote the sampling rate of each IMU as R f , the total sampling time as T, the total number of data points in all activity intervals as N a , and the total number of data points in all stationary intervals as N s . Define activity rate as the proportion of N a over the total number of data points in the total sampling time, which can be described as R a = N a / (R f × T). Set the estimated activity rate ER a to 50%. We summarized the method to find the initial segmenting threshold in the preliminary segmentation stage as shown in Algorithm 1. Figure 8(a) shows the result of the preliminary segmentation of subject 1's FW acceleration, which contains incorrectly segmented activity intervals with various durations. This issue is dealt with by the endpoint detection step.
The endpoint detection step is to determine the boundaries of activity intervals. Segmentation results from the preliminary segmentation stage are assigned with binary labels in which the activity intervals are labeled with one and the stationary intervals are labeled with zero. However, this binary series contains many incorrectly labeled intervals since the initial segmenting threshold is sensitive to local turbulence and transition activities. The strategy in this work to eliminate false segments is to extract constraints and information from HAS, which found the basis of an error correction of the binary series. According to HAS, the time stamps separating STACT from QPACT and QPACT from SPACT can be obtained as T 1 = 100 s and T 2 = 340 s, respectively. Define the upper bound of the shortest duration of an activity interval as δ sdur = 2.5 s. A batch error correction is firstly done to set the label of any interval with a very short duration to zero. Transition activities create prominent acceleration turbulence that may affect the correct labeling in the preliminary segmentation and correspond  to the time-aligned bell curves on the curve of SA_WEE(x). For instance, intervals containing transition activities between 0 and 100 s can be labeled as one, while zero is the label we expect for transition activities. Thus, label reversion needs to be done before the right boundary of the third transition activities (i.e., LD to STND) between 0 and 100 s. Denote the duration of a transition activity as δ tran . As for QPACT between 100 and 340 s, transition activities occur between AS and DS, and after JUD. For the former transition activity, there is δ tran ∈ (2.5 s, 10 s]. For the latter transition activity, there is δ tran ∈ (10 s, 30 s] since steady and slow moves are required during all transition activities. As for SPACT between 100 and 340 s, the upper bound of the duration of a complete fall is defined as δ f = 6 s, and the lower bound of the maximum acceleration during a complete fall is defined as δ facc = 1.8 g. On the basis of the abovementioned constraints and information, we summarized the algorithm for the endpoint detection stage as shown in Algorithm 2. Figure 8(b) shows the result of the endpoint detection of subject 1's FW acceleration. In contrast to Fig. 8(a), all incorrectly segmented activity intervals are corrected.
The final step is to label the result exported by the endpoint detection stage according to the order of target activities predefined by HAS, which is to assign decimal numbers from 1 to 17 to all activity intervals in sequence. Figure 9 shows the automatic labeling result of subject 1's FW acceleration in which all target activity intervals are correctly assigned with the corresponding labels from a qualitative perspective.

ALF performance indicators
The purpose of this study is to reduce manual efforts in labeling raw time series data regarding body-worn sensor-based HAR studies, which conduct data acquisition in a laboratory setting. Hence, two indicators, overall labeling accuracy (OLA) and average labeling time (ALT), are proposed to measure the performance of the proposed ALF. To obtain OLA, ground-truth labels should be given as a benchmark. During data acquisition, each subject's activities were recorded by a video camera, which helped annotators interpret and perform post hoc labeling on time series data. Moreover, to obtain a better insight into OLA, precision, recall, and F-measure are also adopted as subindicators of OLA. To reduce the manual annotating error, four annotators were asked to conduct data labeling, and the average result of activity endpoints was taken as the ground-truth labels of collected activity data. To obtain ALT, labeling time including label checking and correction time taken by each annotator during the whole manual labeling process was timed and then averaged. Note that all IMU modules were synchronized before data acquisition so that all IMU modules collected human activity data simultaneously. Thus, labeling raw data from all IMU modules is completed once labeling raw data from one module is carried out. We selected data from the FW IMU module as our target of automatic labeling. In addition, the proposed ALF was also examined on a modified dataset presented by Ref. 13. The performance of the proposed ALF was evaluated and verified using MATLAB R2016b, which was run in a Windows 10 ×64 operating system with Core TM i7-3612QM 2.10 GHz CPU and 8 GB memory.

OLA
The OLA is defined as the average ratio per subject of the number of correctly labeled data points over the total number of data points as shown in Eq. (6).
where ( ) i c N is the number of correctly labeled data points and ( ) i all N is the total number of data points of the subject i. Comparison between labels created by the proposed ALF and the ground truth was made to obtain R OLA . The labeling accuracies of each subject's activity data are obtained as 96. 2, 94.5, 95.6, 96.2, 96.4, and 96.1% from subjects 1 to 6, respectively. Thus, R OLA is 95.8%. If the automatic labeling result is taken as the response of ALF in a supervised learning perspective, F-measure can be taken as a performance measurement. Precision, recall, and F-measure are usually used in a binary classification setting and consist of three scores including true positive (TP), false positive (FP), and false negative (FN). (26) In our work, nonzero labels (i.e., 17 target activities) are treated as positive, and zero labels (i.e., transition activities and RP) are treated as negative. In this manner, the multilabel labeling issue in this work is turned into a binary labeling issue. The results of precision, recall, F-measure, and accuracy of the proposed ALF tested using collected activity data are presented in Table 1 in which 'M' denotes a male subject and 'F' denotes a female subject on the row of subject number. The average precision, recall, and F-measure are 91.0, 99.5, and 95.0%, respectively. For intervals being automatically assigned with corresponding target activity labels, 91.0% of the labels are correctly assigned, and 99.5% of true target activities are correctly assigned with the corresponding labels. In Table 1, subject 2's precision is lower than 90.0%. Further comparison analysis between automatic labeling results and ground truth shows that the durations of the longest mislabeled segments of each subject's data are 1.15, 6.19, 2.86, 1.45, 1.16, and 1.46 s. Figures 10(a) and 10(b) show the results of automatic labeled and mislabeled intervals of subject 2's and 3's FW accelerations. Mislabeled intervals are assigned with a negative label of −5. In both situations, transition activities (marked in purple dashed circles) that occurred at the beginning or end of a target activity are incorrectly assigned with the next or former target activity label. Even though all subjects were instructed in advance that a stationary status (i.e., staying in RP) is necessary after and before performing a target activity, some degree of inaccuracy in performing each target activity can still happen. Figures 10(c)-10(f) show the remaining results of automatic labeled and mislabeled intervals of other subjects' FW accelerations.
In Ref. 13, a smartphone was used to collect six activities, namely, walking, walking upstairs, walking downstairs, sitting, standing, and laying, and contributed it as the UCI HAR Dataset (UHD). However, the RP status proposed in this work was not incorporated into the data acquisition process in Ref. 13. Thus, a slight modification of the UCI HAR Dataset was carried out so that a performance evaluation on the proposed ALF can be conducted on another public dataset. During data acquisition of Ref. 13, subjects were instructed to perform each activity freely at least twice in a predefined sequence, i.e., standing (30 s) → sitting (30 s) → lying down (30 s) → walking (30 s) → walking downstairs (36 s) → walking upstairs (36 s), and each sequence of activity was performed twice. Thus, RPs were inserted between each target activity, and a HAS of a modified UCI HAR Dataset (mUHD-HAS) was created as shown in Fig. 11. Particularly, to maintain the structure of the original dataset, RP data was created by sampling stationary activities in the current time series, i.e., RP inserted after STND is sampled from the last 10 s of the previous STND, RP inserted after SIT is sampled from the last 10 s of the previous SIT, RP inserted after LD is sampled from the last 10 s of the previous STND owing to the fact that one needs to get up before walking, and RPs inserted between WLK and DS and between DS and AS are both sampled from the first 10 s of the previous STND owing to the fact that one would be standing as an RP. In this way, data from 30 volunteers in the UCI HAR Dataset were modified as shown in Fig. 11 and tested by adopting the proposed ALF. Figures 12(a) and 12(b) show subject 2's resultant acceleration from the UCI HAR Dataset before and after modification and the corresponding original labels.
Before applying ALF to the modified UCI HAR Dataset, constraints and information were extracted from mUHD-HAS. Set the estimated activity rate ER a to 40%. Assume that the  Particularly for stationary activities between 0 s and T 1 , and between T 2 and T 3 , the strategy of assigning labels to each stationary activity was changed to first divide both intervals [0, T 1 ] and [T 2 , T 3 ] into three intervals with identical durations and then to assign the corresponding labels to each interval according to mUHD-HAS. On the basis of the abovementioned constraints and information, precision, recall, F-measure, and accuracy were adopted to evaluate the performance of the proposed ALF on 18 volunteers' activity data from mUHD. The remaining 12 volunteers' activity data (i.e., subjects 1, 9, 17, 18, 19, 21, 22, 23, 24, 26, 28, and 30) are excluded from the test since none of the sequences of their collected data follows the mUHD-HAS. Intervals with a zero label are treated as negative samples while others with nonzero labels are treated as positive samples. Results are presented in Table 2. In Table 2, the average accuracy is 82.1% and the average precision is 81.9%. This may be caused by the inaccuracy of choosing the separating timestamps, which is a consequence of the uncertainty in the durations of subjects' data. The average recall is 97.3%, which indicates that the majority of target activities are correctly labeled, and the corresponding average F-measure is 88.9%. One limitation of this work is that errors are propagated after preliminary segmentation because the preliminary segmentation threshold is chosen according to the estimated activity rate that is subject to the design of HAS and the observation of the collected data, which makes the aforementioned batch error correction after preliminary segmentation necessary. The results of the proposed ALF on the collected dataset [Figs. 10(a)-10(f)] and the mUHD show that activity intervals detected by ALF are generally wider than those defined by ground truth labels, which leads to a certain number of FPs/FNs regarding each target activity. As the results show, for SPACT, activity intervals defined by ground truth labels are enclosed by those detected by ALF, which means that there are only FPs and no FN. For STACT and QPACT, there are both FPs and FNs. Since a fall could be fatal to seniors, the risk of having more FNs than FPs after adopting ALF is higher to SPACT than to STACT and QPACT. A possible strategy to lower such risk is to select a smaller initial threshold, namely, a larger ER a , for preliminary segmentation, which expands the activity intervals but increases the error rate of preliminary segmentation.

ALT
Each annotator labeled activity data collected from six subjects all at once. The labeling time of each annotator is presented in Table 3. The ALT is 76.8 min. However, the automatic labeling time of the proposed ALF is 1116.71 s, which is 18.6 min in total. Therefore, the time spent on automatic labeling using the proposed ALF is 75.8% less than the average time spent on manual labeling. In addition, the automatic labeling time of the proposed ALF on the modified UCI HAR Dataset (18 subjects) is 2160.296 s, which is approximately 36.0 min in total. Since there is no information about the time spent on labeling the whole dataset from Ref. 13 and because of the lack of complete video recordings of the UCI HAR Dataset, no comparison is done between the manual labeling time and the automatic labeling time using ALF on the UCI HAR Dataset.

Conclusions
HAR has become a highlighted research area over the last few years since related research outputs play an increasingly important role in our aging society. In this study, we aimed at reducing labeling efforts on time series data, which is collected over diverse individuals using multiple body-worn IMUs in a laboratory setting. Instead of using a small amount of labeled data to gain a robust classifier, this work focuses on developing an ALF to directly assign accurate labels to unlabeled raw data. In the proposed ALF, a HAS is the information center that is predefined as a sequence of different target activities concatenated by RPs providing constraints to improve the robustness of an automatic labeling algorithm. This algorithm, as the execution part, firstly transformed the original activity data into its corresponding trend of absolute WEE, then segmented it into activity intervals and stationary intervals based on constraints and information extracted from HAS. From the experimental results, an OLA of 95.8% was obtained with the average precision, average recall, and average F-measure score as 91.0, 99.5, and 95.0%, respectively, when the proposed ALF was tested on our collected data. The total labeling time for the proposed framework is approximately 18.6 min, which shortens the manual labeling time (average of 76.8 min) by 75.8%. A public dataset, the UCI HAR Dataset, was modified to the proposed ALF. We obtained the average precision of 81.9% and the overall accuracy of 82.1%, and the average recall and the average F-measure are 97.3% and 88.9%, respectively. The total time for automatically labeling the modified UCI HAR Dataset (18 subjects) is approximately 36.0 min. Both experimental results showed that the proposed ALF can reduce the labeling efforts significantly with a guarantee of labeling accuracy, and the framework can be adopted as a rapid and reliable way of generating labeled datasets.