Decision-tree-based Mapping of Erosion-prone Areas in Hilly Regions of Kangwon Province, North Korea

After the division of the Korean peninsula, North Korea overexploited their natural resources especially the forest. It lost about 23% of the total forest from 1990 to 2011, which continues today. However, the country is inaccessible to monitor such changes. Hence, in this study, we aim to use Landsat 8 imagery with the aid of Google Earth to map erosion-prone areas in a subset area of Kangwon Province, North Korea. Pruned Decision Tree (DT) modeling was used in selecting the optimum ratio/index and threshold based on ground truth points extracted for Landsat scenes from May, October, and both months combined. Pruned DT resulted in applying the normalized green, near-infrared (NIR), green ratio vegetation index (GRVI), red-green ratio index (RGRI), infrared percentage vegetation index (IPVI), and slope with the optimum threshold for the segmentation of the study area with reasonable accuracy. The result shows that combining the ground truths from different seasons resulted in rules giving higher overall accuracy (OA) and kappa coefficient than the individual rule results. However, interchanging ground truths of different months is not effective. On average, out of the total land, high and medium erosion-prone areas are 15 and 20%, respectively. The remaining 65% is covered by forest. The result can be useful for estimating loss and restoring resources such as forest and land in the future.


Introduction
Following the division of the Korean peninsula soon after World War II, North Korea remained isolated to the rest of the world. Natural resources, such as forests, have been exploited extensively in many parts of North Korea owing to a limited supply of energy. According to the United Nations Environment Programme, it was reported that in 2003, forest covered about 73% of the total area of North Korea, 70% of which is on slopes greater than 20 degrees. (1) Between 1990 and 2000, it lost 127 thousand hectares of forest with an annual average change rate of −1.7%. Moreover, between 2000 and 2010, it again lost 127 thousand hectares of forest with an annual average change rate of −2%. (2) In total, between 1990 and 2011, North Korea lost 23.12% of its forest cover, which is the highest among countries in East Asia. (2,3) The accelerated deforestation was due to the economic crisis in the 1990s; people turned to the forest to provide firewood and food. Also, the government control over croplands has made people convert cultivable lands in remote hilly forest areas secretly. These activities have put a lot of pressure on hilly areas, thus leading to soil depletion and erosion. As the nation cannot be freely accessed, field work to investigate the position, type, and activity of erosion to develop an inventory and continuous monitoring system is impossible. However, satellitebased remote sensing technology can make this possible. In this study, we assessed the impact of deforestation on hillslope erosion using remote sensing technologies.
Remote sensing is a powerful and cost-effective technology that allows us to collect data and assess the spatial and temporal dynamics of Earth's surface processes and hazards. (4)(5)(6)(7)(8) Remotely sensed data from legacy satellites such as Landsat have been continuously obtained for the past forty years. With such a large collected redundant database, it is possible to monitor deforestation and thus erosion-prone areas in remote and inaccessible areas. For mid-resolution satellite imagery, two approaches have been widely used to map the spatial distribution and characteristics of vegetation cover. The first is by thresholding remote-sensing-based vegetation indexes, and the second is the classification of multiband satellite imagery either by using clustering techniques (i.e., ISODATA unsupervised classification) or using training data (i.e., supervised classification). Although classification techniques have been widely used, they are unable to account for the image variation in land cover, cloud cover in different seasons, and weather conditions. Moreover, the collection of data for each time and condition is costly and time-consuming.
To address these drawbacks, new sets of input variables were derived for robust mapping. Owing to their simplicity, remote-sensing-image-based indices, such as NDVI, have widely been used as an effective approach to map certain land cover types. (7,(9)(10)(11)(12) Despite being simple and fast, a major issue of this approach is the requirement of a threshold that could be used to segment target and nontarget classes. Depending on the ratio of target to nontarget classes in the study area, neither standard nor automatic thresholds such as Otsu's are always successful. Thus, choosing the exact optimal threshold is also a challenging task depending on the large amount of ground truth and expert knowledge. To assign an optimum threshold for a given set of attributes, the Decision Tree (DT) is better suited. DT establishes a rule for the determination of a target and exhibits a high accuracy across many environments, allocating more homogenous datasets based on binary splits. (13) Binary splitting nodes are based on conditions of explanatory variables that can be easily understood and implemented in Geographic Information System (GIS). (14) Despite few studies on land cover and forest monitoring in the past, (15)(16)(17) not much studies have efficacy in the assessment of erosion-prone areas using remote sensing and DT modeling approaches for highly rugged steep mountainous landscapes. Moreover, studies that consider the remote sensing index in North Korea are also rare. Hence, the aim of this study is to map erosion-prone areas over an area of Kangwon Province, North Korea, using Landsat Imagery and DT. Satellite-based digital elevation model (DEM) and imagery were selected, indices were derived along with slope, ground truth points were carefully chosen for optimum threshold and band using DT, and accuracy assessment was carried out. The overall flow of the study is shown in Fig. 1.

Study area
A rectangular area from Kangwon Province, North Korea, has been selected for this study (Fig. 2). The area is geographically bounded between 38°18'41.45"N to 38°57'59.38"N and 126°47'49.73"E to 128° 0'14.97"E and is 7520 sq. km. It has a rugged topography and a dendritic drainage pattern. The elevation ranges from 0 to 1510 m. Most of the area is forested with deforested patches, agricultural lands, and forest roads with few urban area. It also has dammed reservoirs and coastal areas covering ~2.45% of its total area. It was selected for two reasons; first, it is densely forested with extensively eroded landscapes that can be seen in Google Earth imagery, and it has very similar climate and topography to Gangwon Province of South Korea, so that a field observation of an area in South Korea adjacent to the study area can be carried out to better understand the geologic and geomorphic environments.

Data
Three types of data, namely, Shuttle Radar Topography Mission (SRTM) Global DEM, Landsat Operational Land Imager (OLI) imagery, and high-resolution images available from Google Earth (GE) Pro TM (Google Inc., Menlo Park, CA, USA), were used. The selection was carried out carefully considering cloud cover, phenology, and dryness of the ground that covers most parts of Kangwon Province, North Korea. Landsat data from early May and October were selected because the study area in these images was observed with the lowest vegetation cover, thereby providing a relatively unobstructed view of the surface. Both Landsat and DEM were resampled to a 30-m-resolution subset scene of 3541 rows and 2360 columns after preprocessing the data.
For the extraction of elevation for the scene, the SRTM Global DEM of 30 m (18) was obtained from the OpenTopography portal. (19) Its version 3.0 product is void-filled and the most complete high-resolution digital topographic database of Earth. Their evaluation showed good representation with other data in hilly (20) and coastal (21) regions of Korea. The data was used to provide the ground elevation and slope of the study area.
Two Landsat scenes from path 116 and row 33 were obtained from the United States Geological Survey (USGS) Global Visualization Viewer (GLOVIS) portal. (23) Table 1 shows the metadata of the Landsat scene used. The Level 1 terrain-corrected multiband image contained coastal blue, blue, green, red, near-infrared (NIR), shortwave infrared 1 (SWIR1), and shortwave infrared 2 (SWIR2) bands at a resolution of 30 m. The digitally numbered images were first converted to top-of-atmosphere radiance and then reflectance using the Fast Line-of-sight Atmospheric Analysis of Hypercubes (FLAASH) atmospheric correction tool in Environment for Visualizing Images (ENVI) version 5.3 (Exelis Visual Information Solutions, Boulder, CO, USA). The required coefficients and values were obtained from the Landsat response function, metadata file, and SRTM DEM. Google Earth has been widely used in various studies, (5,(24)(25)(26)(27)(28) even in North Korea, (17) as a reference and validation data. Because access to the area is limited, this study used Google Earth imagery for training and validation purposes in this study. A total of 350 points were randomly sampled in the study area for forest cover, hill, and open land surface (Fig. 2).

Remote sensing indices
Various vegetation indices based on remote sensing products or technologies were used to characterize vegetation cover (Table 2). All these indices were taken from the Harris Geospatial Docs Centre website (41) where more details on them and their applications can be found.

DT
DT is a hierarchical model that computes the probability of occurrence in various situations based on known information (Fig. 3). It can also be a rule-based structure giving an answer for what happens if something in the model is modified. The main advantage of the method is that it is a graphical representation of the overall model that could be easily constructed and is an intuitive application of probability analysis. However, it does not allow multiple outputs and is limited to reducing the noise in data. (42) There are various available algorithms such as Classification and Regression Tree, (43) Chi-square Automatic Interaction Detector Decision Tree, (44) ID3, J48, (45) and C4.5. (46) The classification of a new item in the algorithm first requires a DT based on the attribute values of the available training data. On the basis of the available set of items in the training data, it identifies the attribute that classifies the various instances most clearly. The feature that tells us most about the data instances, which could lead to the best classification, is said to have the highest information gain. On the basis of the possible values of the feature, branches are terminated, and a target value is assigned. In other cases, the algorithm searches for other attributes that provide the highest information gain. The process continues until a clear decision regarding the combination of attributes that gives us a rule for the determination of a target value is achieved. With the help of this DT, all the respective attributes and their values undergo checking, thereby assigning or predicting the target values of all new instances. If DT becomes complicated, the tree structure and rules might be difficult to interpret and may result in a low accuracy. A simple technique of pruning is not to continue splitting if the nodes receive a very small or minimum number of instances per leaf. Building a full tree and pruning back are more reasonable than trying to forward pruning as a statistical test can also be applied to each stage. The test considers a confidence factor of 0.25 by default, where a small value incurs more pruning. It simplifies the results and gives manageable trees in addition to overfitting the training data without affecting the performance considerably. (47) In this study, pruned DT was used to generalize data from the root, i.e., the best attribute to split the first decision. We used a minimum object of 10% of ground truth as the pruning  (45) was used for this study.

Accuracy assessment
To prevent overfitting and biased assessment, a k-fold cross-validation method has widely been used. For given k folds, the method divides whole data into k parts, one piece, and trains the rest. However, there are many ways to divide the data into k-folds. The stratified n-fold cross-validation is preferred to ensure that each fold has the right portion of each class value, which reduces the variance in the estimate. In this study, we used the stratified 10-fold crossvalidation implanted in WEKA.
An accuracy assessment was carried out using a confusion matrix. (48) Table 3 shows a typical layout of the matrix, with the columns representing the classified data and the rows representing the ground truth values, although both can be interchanged. In the table, the diagonal elements are the pixels of agreements, whereas the off-diagonal elements are disagreements. From the confusion matrix, overall accuracy (OA) was calculated by dividing the sum of the entries along the diagonal by the total number of validation samples. OA represents the current predictions and ranges from 0 to 1, where 1 is perfect. However, it does not consider the agreements between datasets, which are due to chance alone. Thus, the kappa coefficient (kappa) of the agreement was also derived from the confusion matrix. Kappa is a measure of agreement based on the difference between the actual agreement in the error matrix and the chance agreement. (48) Usually, kappa can range from −1 to +1, where 0 represents the degree of agreement that can be expected from random chance and 1 represents a perfect agreement between the raters. In very rare cases, kappa can show a negative value, which signifies that there is no effective agreement between the two raters. (49) OA and kappa are calculated using Eqs. (1) and (2), respectively. Aside from OA, producer's accuracy (PA) and user's accuracy (UA) were also calculated on the basis of the confusion matrix.

Number of correct predictions Overall accuracy
Total number of predictions = Observed accuracy Chance agreement Kappa coefficient = 1 Chance agreement − − (2)

Erosion-prone area
In this study, three conditions were considered to identify erosion-prone areas. First, erosion occurs in open land; second, erosion is accelerated in hilly areas; third, erosion is controlled by forest/vegetation cover. Hence, three separate maps were prepared and overlaid as per their relationship to erosion. To better understand the erosion hazard, open space was given a value of 1, hill as 1, and nonforest areas as 1, such that their sum will give the hazard, as shown in Table  4. Pixels fulfilling all the conditions were high erosion-prone areas and with two conditions fulfilled as medium and only one or none as low.

Results and Discussion
After preprocessing the data, different band ratios and indices were derived from Landsat 8 images as per Table 2. A similar slope in degree was also derived from DEM. For a better analysis of the erosion-prone areas, water bodies were masked from the study area using NDWI. For the ground truth points, the corresponding bands, indices, and slopes were extracted for both 20160519 and 20161010 scenes. In addition to the two extracted data, an additional dataset combining both May and October extracts was also prepared to understand the generalized ability of derivative bands in all-year-round scenes. All these comma-separated values were converted to the attribute-relation file format (ARFF) for input in WEKA.
With the above-mentioned ARFF points, first, the ground truth data were evaluated for their baseline accuracy by stratified 10-fold cross-validation in WEKA. The baseline accuracy was tested using the ZeroR rule, which predicts the basic model for a class, i.e., if the data were all mode class, what could be the OA? This gives the idea of how the mode class is in the input data. From Table 5, it can be understood that out of the total ground truth data, 78.86% (276) are hills, 60.86% (213) are nonopen land and, 58% (253) are forest class.
For each class, open land, hill, and forest, first, J48 pruned trees with a minimum of 10% ground truth data for two objects were built for three conditions i.e., May, October, and the two months combined, as shown in Table 6. The table shows good OA of around 90% or more with relatively large numbers of leaves and trees. However, these trees are difficult to interpret. For example, Fig. 4 shows the J48 pruned DT for a forest in the 20161010 scene using the ground truth extracted from the scene itself. Hence, in this work, we focused on the use of binary left pruned DT for easy interpretation and application in mapping erosion-prone areas in the inaccessible area of North Korea. This helps in the regular short-interval monitoring of land covers such as surface water, open land, and forest in areas with a high rate of change. As a result, from a pruned tree for 10% ground truth data, Table 6 shows the application of one band for each class segmentation. On the basis of Table 6, three different maps were derived for both the 20160519 and 20161010 scenes. Figures 5 and 6 respectively show the results of open land, hill, and forest cover for 20160519 and 20161010 scenes derived from the optimum band and threshold in Table 6. Variations in optimum ratio/index and threshold for the different classes are seen in the table. For the 20160519 scene, RGRI above 0.7822 was found to be the optimum band and threshold to segment open land from the rest. It was interesting that NG was able to segment both the hill and forest classes with different thresholds. As the May Table 6 Pruned DT for open land, hill, and forest covers using ground truth data from May, October, and the two months combined.
May   Table 6.  Table 6. Only the newly grown leaves of the forest absorbed blue and red lights but reflected NIR and green lights. Thus, there is a sharp increase in reflectance between the red and NIR regions, which is known as the red edge and is used in plant stress detection. (50) This process verifies the adoption of RGRI and NG for the 20160519 scene, while in the case of the 20161010 scene, the forest loses its chlorophyll; thus, deciduous plants reflect more NIR than in an earlier stage. This verifies the use of IPVI in segmenting forest. The slope above 6.9263 degrees is selected as optimum to segment hill and flatlands. As the autumn leaves are colorful, GRVI, i.e., red/ green ratio, is optimum for open land segregation. Moreover, the GNDVI above 0.6897 was able to produce a similar result, which used NIR and green bands for the index.
To further understand the effect of these optimum indexes, we combined both ground truths for the selection of the optimum ratio/index for segregation. Using a minimum of 70 objects as pruning conditions, we found that NNIR alone is sufficient for segregating open land from forest cover, while the slope is optimum for hilly areas. An NNIR value below and above the threshold segregates open land and forest; in other words, they are complementary in terms of cover. The study areas can be considered complementary in terms of forest or open land as the area is rural with little build-up and some surface water.
For accuracy assessment, on the basis of the confusion matrix, UA, PA, OA, and kappa were derived for each threshold in Table 6 for both scenes. Table 7 shows the results of accuracy assessment based on ground truth points for open land, hill, and forest cover for 20160519 and 20161010 scenes using the stratified 10-fold cross-validation. For better understanding, we also plotted in Fig. 7 the densities of each target and nontarget based on the ground truth for the optimum band in Table 6.
In Table 7, most of the measures are around 90% except for hill. As the terrain is not known well and SRTM also has only a 30 m resolution, a misjudgment might occur while labeling the ground truth. In Fig. 7, it is clearly visible that hill and nonhill ground truths have relatively large overlapping values in terms of NG and slope. UA for the nonforest cover and PA for Hence, using a multiple-season ground truth could be a better option than selecting only one for land cover mapping. However, the improvement was clear, that is, without actual knowledge of the study area and field works, it is difficult to avoid the error, as shown in Fig. 7. Most of the target and nontarget classes were overlapping. This suggests that a careful selection of the ground truth should be carried out to ensure results with higher accuracy even for inaccessible areas.
Finally, erosion-prone area maps were derived from binary maps, i.e., Figs. 5 and 6 of open land, hill, and forest derived from Table 6. The erosion-prone areas were indexed as low, medium, and high hazards, as per the rule in Table 4. The resulting maps of the erosionprone areas are shown in Fig. 8 with the corresponding areal coverage being given in Table 8. Except for the 20161010 scene with the May ground truth, all others show consistent erosion-  Table 6.  Table 8 Areal distribution (%) of the erosion-prone area in 20150519 and 20161010 scenes for maps in Fig. 8 prone areas. Because we were unable to access the actual site, it is impossible to judge the accuracy of the results obtained; however, a decision can be made on the basis of the average areal distribution. As shown in Table 4, we used the average of all the results to approximate the erosion-prone area in the study area. Out of the total land area of 7338.61 km 2 (excluding surface water) in the study area, around 20 and 15% are moderately and highly erosion-prone areas, respectively. The remaining 65%, which does not include erosion-prone areas, is mostly forested.

Conclusions
In this study, we applied remote sensing technology to investigate erosion-prone areas over an inaccessible area. A subset area from Kangwon Province, North Korea, was used as a test study site and Landsat Imagery from May and October as data for mapping. Satellitebased DEM and imagery were selected, indices were derived along with slope, ground truth points were carefully chosen for optimum threshold and band using pruned DT, and accuracy assessment was carried out. The results of this study can be summarized as follows: (a) Using remotely sensed data from Landsat and high-resolution imagery in Google Earth, it is possible to map different land covers and thus derive erosion-prone maps with reasonable accuracy. (b) Pruned J48 DT helped in the segmentation of the imagery based on single-band data of ratio/ index without degrading accuracy. The optimum thresholds were much easier to understand than a complex tree with a minimum of 2 objects for binary classification. (c) Although the optimum band and threshold derived from the October ground truth were able to give acceptable results, they seemed ineffective for the interchange rule as demonstrated for the 20161010 scene using May rules. However, combining the ground truths from different seasons resulted in rules giving better OA and kappa than the individual rule results. (d) From the UA, PA, and density plots, we can understand that, despite a careful selection of ground truths from a high-resolution image, good knowledge of the study area and field investigations are necessary to select ground truths. (e) The average of erosion-prone maps, except for the 20161010 scene with the May ground truth, shows that 15 and 20% are high and medium erosion-prone areas out of the total land, respectively. The remaining 65% is due to forest cover.
On the basis of these results, we found that remotely sensed satellite imagery could be used to obtain land cover and erosion-prone area maps with reasonable accuracy. Although limited to one sensor image subset in a hilly area, satellite images could be very useful for monitoring changes in forest and open land covers over inaccessible areas.
Such studies can be very useful for calculating the total cost for planning the recovery of North Korean mountains, especially the bare land identified from satellite images. Future works will compare the green coverage and land cover pattern between Kangwon Province of North Korea and Gangwon Province of South Korea.