Identifying Staying Places with Global Positioning System Movement Data Using 3D Density-based Spatial Clustering of Applications with Noise

In this study, we visualize and analyze global positioning system (GPS) data to identify the spatiotemporal characteristics of moving and staying patterns. As a case study, we collect and process GPS data generated by students participating in inquiry-based fieldwork. Space-time path (STP) analysis is applied to visualize movement, while density-based spatial clustering of applications with noise (DBSCAN) is used to identify spatial clusters or staying places (sites where people spend time, such as homes and workplaces). We find that some clusters derived by DBSCAN are not actual clusters, and the times spent in some clusters are overestimated when we investigate the time spent in each cluster. To resolve this, 3D DBSCAN is used to find precise clusters. The results show that the 3D DBSCAN method is effective in finding clusters of spatiotemporal data. The 3D DBSCAN methodology proposed in this study can be applied effectively in movement data analysis, such as tourist travel patterns through SNS, trajectories of cars, vessels, or wildlife, and the movement of visitors in parks.


Introduction
Owing to recent increases in mobile data and web usage, it has become easier to use the log data recorded during daily life; many studies have analyzed these data to find significant information. Log data created from various sensors contain spatiotemporal information; however, it is not easy to understand the characteristics of these data because the volume of data is large as well as its structure is very complex. Visualizing spatiotemporal data plays an important role in discovering data characteristics. A typical method of visualizing 3D spatiotemporal data is space-time path (STP) analysis. (1)(2)(3) Zhao et al. used STP to visualize the movement behavior of tourists visiting New Zealand and analyzed which activities (e.g., sightseeing, vacation, and dining) tourists were performing at staying places. (2) Zhao et al. visualized Halifax census tract data according to individual daily movement patterns. (1) Ostermann showed that it was possible to trace the movement paths of park visitors and identify places with a high density of activities by using global positioning system (GPS) location information. (4) In addition, several studies have been conducted using GPS data to analyze users' positions and extract staying places such as user's homes and workplaces (5) or points of interest (POIs) (6,7) by considering movement and time information. Other studies have analyzed user movement paths to understand preferred routes or find similar trajectories, (8,9) proposed systems for identifying major places for each user and recommending activities that can be done at these places, (10) and proposed a methodology for predicting a user's next destination using their GPS data. (11,12) Clustering algorithms are commonly used in these studies to identify the staying places of groups. (6,10,11,(13)(14)(15)(16) The density-based spatial clustering of applications with noise (DBSCAN), which can distinguish areas with high density from those with low density within a constant spatial range, is widely applied. (17,18) However, the DBSCAN method has a limitation for analyzing spatially dense regions because it does not consider the time dimension. In this study, we propose a 3D DBSCAN methodology to identify clusters of spatiotemporal data. To do this, we apply our 3D DBSCAN methodology with the GPS data collected from inquiry-based fieldwork and analyze how 3D DBSCAN results differ from regular DBSCAN results.

Log data collection and processing
In this study, we analyze log data collected during an inquiry-based fieldwork program, "sustainable development of Yangdong Village," a location registered as a World Heritage Site for its traditional Korean Hanok houses. Although many buildings have been preserved, some have been converted from residential to commercial use and from traditional to modern style. Student activities at Yangdong village include three types: making basic observations of whether an individual house has been changed or not, performing an activity requiring a judgment on whether a house should be allowed to change or must be preserved, and conducting interviews with residents. The house-based investigation takes little time, but the activities requiring students' judgment or interviews with residents take significantly longer. The students use Collector for ArcGIS on mobile devices to record their answers to inquiry questions, and the results are saved in the cloud-based ArcGIS Online in real time. The Collector for ArcGIS application does not track and store students' movement logs. (19) Therefore, a multi-camcorder equipped with a GPS device is used for this purpose. Table 1 shows the environment for collecting movement logs as well as the equipment and tools used to collect log data. A movement log contains the time, latitude, longitude, altitude, speed, and slope at which field   activities are performed, as well as errors. This information is saved in the .nmea file format. Figure 1 shows an example of 2D GPS data. The movement log, once collected, is processed in three steps, as shown in Fig. 2. First, the .nmea file, which is the log data collected from the multi-camcorder, is converted into a .txt file by the GPS Visualizer program, which in turn is converted into a .csv file. Second, the csv file is preprocessed by removing data beyond the spatial range or with errors. Third, the .csv file is converted into a .shp file, in which a spatial coordinate system is defined and the time fields of attribute data are converted into a date format (yyyy-mm-dd hh:mm:ss) and then a numeric format, which can be recognized by GeoDataBase (GDB) and R. Finally, data analysis is performed in three steps, namely, visualizing the log data 3D format with STP, locating staying places using DBSCAN, and checking the clusters with 3D DBSCAN considering the time dimension.

Spatiotemporal visualization of movement logs using STP
To visualize the log data, the STP technique is applied. This technique is based on the concept that time can be measured in the same way as distance. (20,21) A person's movement path can be shown on a 2D plane; however, in this concept, a time axis is added, and the movement path's position at each time interval is shown in a 3D space to create a STP. (22) When convergence occurs at a particular point on the STP, that point is called a station, which is a place where activities are focused over a fixed time period, such as a school, a house, or a restaurant. The open-source R3.5.0 program is applied to create STPs because R has scatterplot3d and rgl packages that can create STPs in addition to simple 3D point mapping. Longitude, latitude, and time data are entered as the x, y, and z coordinates, respectively. After performing the scatterplot3d() and plot3d() functions, these factors are changed into a continuous plot in STP form. The source code used in this study is shown in Table 2.

Methodology for identifying staying places
To locate staying places, it is necessary to distinguish movement from stopping in the GPS point data. When movement-stopping patterns are visualized as STPs, parts parallel to the time axis are assigned as "stopping," and spatial clusters derived by the cluster analysis of the point data are classified as staying places. To distinguish movement from stopping and to find staying places, many previous studies have used clustering methods. Of the various forms of clustering, typical algorithms for finding staying places include the k-means algorithm, (11) fuzzy clustering algorithm, (10) and grid-based clustering methodologies. (10,13,14) The k-means and fuzzy clustering algorithms have been used many times, but they have the drawback of requiring the number of clusters to be set beforehand, following the assumption that each point must belong to a specific cluster. Grid-based clustering methodologies are suitable for finding clusters considering the spatial dimension, but they have the drawback of not being suitable for detecting POIs and movement paths in a more detailed way. In this study, we use DBSCAN, Table 2 3D visualization source code.

[Definition 2] Point classes:
A point p ∈ D (a domain) is classified as follows: (a) a core point if N ε (p) has high density; that is, |N ε (p)| ≥ minPts where minPts ∈ ℤ + is a user-specified density threshold; (b) a border point if p is not itself a core point, but is in the neighborhood of a core point q ∈ D, i.e., p ∈ N ε (q); or (c) a noise point, otherwise. [Definition 3] Directly density-reachable points: A point q ∈ D is directly density-reachable from a point p ∈ D with respect to ε and minPts if and only if and q ∈ N ε (p).
DBSCAN's basic functions are shown in Table 3. Euclidean distance is generally used for neighbor calculations of this function. To apply DBSCAN, it is necessary to determine the adjacent radius ε and the density threshold (i.e., the minimum number of points, minPts). To find an appropriate ε, the points' kNN distances (i.e., the distance of each point to its kth nearest neighbor) are plotted in descending order on the basis of heuristic methods, and protruding points on the plot need to be found.
A staying place in this study is simply where students spend lots of time continuously during the fieldwork. Staying places may relate to observation or inquiry activities, or they may not. The amount of time spent at a staying place can be interpreted as showing the level of problem difficulty if the staying place is an observation or inquiry activity location. Otherwise, it can be interpreted as a place where some element hinders inquiry activity. Therefore, the time analysis of staying places is necessary.
In this study, the first GPS log included among the multiple GPS points in a staying cluster is defined as gs and the last log is defined as ge. The time difference between gs and ge is the staying time. Time information at staying places is illustrated in Fig. 3. We find that at least 2-3 min and at most 5-8 min are spent at staying places during observation activities.
However, there are also cases where more than 15 min, or 10% of the overall investigation time, is measured as the staying time. For example, when one place is visited several times with movement and staying occurring repeatedly, the temporal range of the cluster increases. As shown in Fig. 4, this place is a junction point on a path through which students moved to other places rather than a place where they stayed continuously. As the time interval for passing this point becomes larger, time becomes overestimated. In cases like this, it is necessary to apply the DBSCAN method while also considering the time dimension. To solve the above problem, we propose a 3D DBSCAN methodology, which is a method of calculating the distance d between points p and q in three dimensions as shown in Eq. (5). Note that x, y, and z values represent latitude, longitude, and time, respectively. The x, y, and z values are incorporated in combination with the plot3d() function to identify clusters while considering time (Table 4). In this study, x, y, and z data are converted and analyzed as follows. The x and y data are converted into meters by transforming the geographic coordinate system of latitude and longitude to a projection coordinate system (EPSG: 3857 WGS_1984_Web_Mercator_ Auxiliary_Sphere). The z data are converted from date type (yyyy-mm-dd hh: mm: ss) to numeric type (5 digits and decimal places). The conversion principle is that the numeric value increases by 1 as a day passes.  3D DBSCAN: distance between two points p(x 1 , y 1 , z 1 ) and q(x 2 , y 2 , z 2 ):

Movement log visualization
Log data were created during approximately 150 min of fieldwork by eight teams with two or three students each. The STP visualization shows not only the movement distance, range, Table 4 Source code for 3D DBSCAN.
3D and direction, but also the staying places by creating straight lines in the z or time direction when the staying time at a given point is sufficient (Fig. 5). After observation or inquiry activities, students were asked to enter their answers, so most of the places they stayed match the locations where they entered their answers. That said, staying places with and without answers are distinguished. Staying places with answers can be defined as places of observation. Staying places without answers are places where other activities occur, and additional analysis is needed to identify why students stayed at those places. Figure 6 shows a generalization of the movement characteristics discovered through the visualization.

Finding staying places with DBSCAN
The most important element in using the DBSCAN method is setting the ε (minimum search radius) and minPts (in the area) so as to identify clusters. In this study, we extract 20% of all data at random for use as a test data set. The optimum values are found by cross-testing the ε and minPts values from the test data and using the kNNdistplot() function. We set arbitrary minPts and then apply kNNdistplot(). On average, about 50 points of log data are collected for 1 min, about 70 points per min are collected at moving intervals, and about 40 to 60 points are collected at staying intervals. Therefore, after setting minPts in the range of 40 to 60, we use kNN distances to find appropriate ε values. Based on cross-testing and kNNdistplot() on the test data, the critical value is 0.00005 (about a 10 m radius) for all these minimum numbers     Table 5 shows the number of clusters based on different combinations of the two parameters. The parameter conditions of ε = 0.00005 and minPts = 40 identify the largest number of clusters that are closest to actual staying places (Fig. 8). Figure 9 shows the movement of each team and clusters that represent staying places. The staying places for each team are shown in Table 6. In total, 100 clusters are identified, and the number of clusters for each team is at least 4 and at most 27. There are also 9652 noise points.

Analysis of staying places using 3D DBSCAN
GPS data were analyzed using the 3D DBSCAN method to determine whether or not the clusters generated using 2D information remain as the same cluster when considering time. Table 7 shows the numbers of clusters determined by the DBSCAN and 3D DBSCAN methods. The clusters analyzed using 3D DBSCAN are shown in two types: the numbers of clusters viewed from a 2D plan and a 3D cube. We found that the numbers of clusters extracted are smaller using the 3D DBSCAN method than using the DBSCAN method. Some clusters derived in two dimensions are no longer clusters, or else are subdivided, when the time dimension is accounted for using 3D DBSCAN.   T1  T2  T3  T4  T5  T6  T7  T8  Total  GPS points  6525  3549  1784  1735  6900  2948  5976  1390  30807  Number of clusters  27  9  8  9  19  10  14  4  100  Noise points  1636  1216  360  350  1617  2948  1127 398 9652  In the case of Team 2, nine clusters are extracted from DBSCAN. However, when the cluster is analyzed using the 3D DBSCAN method, only four clusters remain in the 2D plan view and the other five clusters have disappeared. Moreover, the remaining four clusters are further subdivided into nine clusters when the time dimension is taken into consideration. Table 8 shows the cluster analysis results for Team 2 in detail. Five of the nine clusters extracted from DBSCAN are removed by 3D DBSCAN, namely, the 1st, 3rd, 4th, 6th, and 9th clusters. The 2nd and 8th clusters remain the same whether applying the DBSCAN or 3D DBSCAN method. However, the 5th and 7th clusters are subdivided into several clusters. For the 5th cluster, the 18 min spent as found by DBSCAN is subdivided into 10, 2, and 2 min spent in separate clusters, reducing the total staying time in the 5th cluster from 18 to 14 min when the 3D DBSCAN method is applied. For the 7th cluster, the 31 min spent according to DBSCAN is subdivided into 4, 10, 2, and 10 min, for a total in-cluster time of 26 min with 3D DBSCAN. Figure 10 shows the clusters of Team 2 in (a) 2D and (b) 3D spaces. Figure 10(a) shows nine clusters according to DBSCAN results, while Fig. 10(b) shows four clusters subdivided into nine clusters according to 3D DBSCAN results. These results indicate that the 3D DBSCAN method is more accurate than the DBSCAN method in the cluster analysis of spatiotemporal data.

Conclusion
In this study, we visualized and analyzed the characteristics of moving and staying patterns with spatiotemporal log data. As a case study, we collected and processed GPS log data generated by students who participated in inquiry-based fieldwork. The results of this study are as follows: First, we identified the spatiotemporal characteristics of the movement logs and staying patterns through STP visualization. Second, we investigated the spatial ranges of and time spent during the field activities by DBSCAN to identify staying places, namely, clusters. Third, we found that some clusters derived by DBSCAN are not true clusters, while the time spent in other clusters was overestimated when we investigated the time spent in each cluster more closely. To resolve this, 3D DBSCAN was applied in our study to find precise clusters. The results showed that the 3D DBSCAN method is effective in finding clusters of spatiotemporal data.
The results of this study showed that the 3D DBSCAN method is more effective than the DBSCAN method in the staying place analysis of spatiotemporal data. However, one limitation is that data analysis was carried out in the limited context of inquiry-based fieldwork activities over a 150 min period. In the future, the 3D DBSCAN methodology proposed in this study can be applied effectively in broader contexts of movement data analysis such as tourist travel patterns through SNS, trajectories of cars, vessels, or wildlife, and the movement of visitors in parks.