Application of Energy-efficient Data Gathering to Wireless Sensor Network by Exploiting Spatial Correlation

The paper is focused on the study of energy-saving data gathering strategies based on spatial correlation in wireless sensor networks (WSNs). First, the factors influencing spatial correlation on distortion are discussed, and we prove that representative nodes can be selected to reduce data transmission within a certain range of distortion. Second, the performance of greedy corrected clustering (GCC) and k-means algorithms are analyzed. An energy-efficient gathering scheme was proposed by applying GCC and k-means algorithm to a typical lowenergy adaptive clustering hierarchy (LEACH) protocol; the results of simulation show that the scheme can save energy, reduce distortion, and prolong network lifetime.


Introduction
A wireless sensor network (WSN) is usually an event-driven system where several nodes try to transmit data when any physical phenomenon of interest is detected. The main objective of the WSN is to reliably estimate event features from the collective information provided by sensor nodes. It is a challenging problem to collect data continuously from a WSN with limited energy and bandwidth.
In a densely deployed WSN, sensing data are likely spatially correlated because one sensor's information can be inferred from its neighboring sensors. Therefore we can remove or reduce the redundancy in the data and reduce communication overhead and energy consumption in a network.
There have been several studies of energy efficient protocols in WSNs. Some approaches seek to optimize communication protocols that spread congestion and energy consumption evenly throughout the network. (1) Many techniques, on the other hand, design a protocol by considering a spatial correlation. Pradhan et al. investigated aspects of information theory of correlation in a WSN. (2) Intanagonwiwat proposed a method which exploits spatial correlation inherent in sensor network data combined with a traditional routing protocol. (3) Yang et al. applied compress sensing (CS) theory to gather and reconstruct the sparse signals in energyconstrained large-scale WSNs. (4) However, most did not allow for efficient data gathering which exploit correlations in the WSNs. (5) In this paper, we propose an energy-efficient data gathering method which divides the network into clusters of spatially correlated sensors by a greedy corrected clustering (GCC) or k-means algorithm and suppresses data transmission within a certain distortion level and with minimum energy-expenditure.
The remainder of the paper is organized as follows: The network model and assumptions are introduced, and the factors influencing spatial correlation are discussed in Sect. 2. The correlated clustering methods based on GCC and k-means are introduced and a possible way to improve the protocol is proposed in Sect. 3. The results of simulation and relevant analysis are given in Sect. 4. Finally, conclusions are summarized in Sect. 5.

Model of spatial correlation
The correlation model for information collection by N sensors in an event area is shown in Fig. 1. (6) The sink estimates the event source S, according to the observations of the sensor nodes, n i , assuming that the samples are temporally independent. Each observed sample, X i , of sensor n i is represented as where the subscript i denotes the spatial location of node n i . The event S i and observation noise N i are modeled as Gaussian random variables of zero mean and variance σ S 2 and σ N 2 , respectively. The correlation model is a power exponential model, as expressed by (6) where ρ(i,j) and d(i,j) are the correlation coefficient and distance between nodes and n i and n j , respectively. The parameters α = 1 and θ are the controlling parameters for the correlation range between sensors. Each node encodes its X i as Y i = f i (X i ) and sends it to the sink through the WSN.
where P E is power constraint. The encoders and the decoders are labeled E and D in Fig.  1, respectively. The sink decodes each Y i using the minimum mean squared error (MMSE) estimator, so Ŝ is expressed as (7) Ŝ The distortion achieved by using M packets to estimate the event S is given as

Factors impacting spatial correlation
In WSNs, sensor nodes are usually distributed in a zone and the related information is sent to the sink for centralized processing. When a certain condition (such as temperature, humidity, etc.) exists, the nodes under that condition are aware of this information. There is a strong correlation between the nodes that are close to each other. Therefore, to satisfy the sensing precision, some nodes are chosen, as representative nodes (RN), to send their data rather than having data sent by all nodes in the network. In this way, correlation clusters are formed by taking the representative node as the center; the correlation radius is R c .
By spatially correlating sensing data, the energy consumption of data transmission and collision between sensor nodes will be reduced greatly. Selecting the minimum representative nodes among several nodes is crucial and can be represented as the following.
where D max is the maximum allowable distortion.
In an area of 500 × 500 m 2 , 50 nodes were distributed randomly, and some nodes were chosen as the representative nodes. Using the model in Eq. (2), θ was taken for 10-1000. For each value of θ the sink calculated the distortion between the collected information from representative nodes and the actual information from all nodes according to Eq. (5). The results are shown in Fig. 2.
From Fig. 2, we can see that as θ and the number of representative nodes increase, the observed event distortion decreases because of the highly redundant data sent by the sensor nodes that are close to each other.
Moreover, for a fixed number of representative nodes, the minimum distortion can be achieved by choosing the nodes which are located as close to the source as possible and as far apart from each other as possible.
Therefore, we can exploit the spatial correlations between sensing data by choosing appropriate representative nodes among all the nodes to reduce the data forwarded to the sink. This method cannot only save energy without degrading the achieved distortion at the sink but can also reduce the conflict within the wireless medium.

Greedy corrected clustering method
As we discussed in Sect. 2.2, correlation of data can be considered in the design process of data gathering. The GCC algorithm is a clusters method useful for correlation. (8) When the GCC algorithm is used, sets of correlated clusters are formed and all the nodes within a cluster are considered highly correlated. The information is observed by multiple sensor nodes in the event area creating redundant reports. Only a few nodes need to report the sensory data, and the remaining nodes can remain in a silent state to save energy. Suppose we are given n points, and want to find k clusters. A cluster is a subset of the n points, called C j . The GCC algorithm is show in Table 1.
First, randomly select a node from all nodes and a node j from the no-cluster nodes. Then calculate the distance d(i,j) between the two nodes. Put the nodes which meet the condition d(i,j) ˂ ξ into the set C k (i,j); k is the number of the cluster; ξ is correlation threshold. If the node does not satisfy the conditions, choose two of the most relevant nodes from the cluster members and perform the above steps until the nodes form cluster k.

K-means method
K-means clustering is another algorithm used to classify or group objects based on features into k numbers of groups, some criteria such as Bayesian information criterion (BIC) or minimum description length (MDL) can be used to estimate k automatically.
Given a set X of n points in a d-dimensional space and an integer k, the task is choosing a set of k points {c 1 , K, c k } in the d-dimensional space to form clusters {C 1 , C 2 , …, C k } such that Eq. (7) is minimized.
The method is very simple to implement. The grouping is done by minimizing the sum of the squares of the distances between a data point and the corresponding cluster centroid. Therefore, the purpose of k-means clustering is to classify data relatively evenly. The process of the k-means algorithm is shown in Table 2. Table 2 k-means algorithm. Input: The coordinates of all nodes. Output: Clusters C k (i, j) Step 1. Randomly pick k cluster centers {c 1 , K, c k }.
Step 2. For each i, set the cluster c i to be the set of points in X that are closer to c i than they are to c j for all i ≠ j.
Step 3. For each i, let c i be the center of cluster c i (the mean of the vectors in c i ).

Input:
The coordinates of all nodes.
Step 2. Choose a node n i from the no-cluster nodes. Calculate the distance d(i, j) between nodes n i and n j .
Step 4. Put n i and n j into k clusters C k = {n i , n j }.
Step 5. If n i (no-cluster node) exists, go to Step 2.

Proposed data gathering scheme
The data gathering model is a hierarchical model shown in Fig. 3. In the first level, the entire sensor field is divided into several correlated sub-regions and a subset of nodes is selected as representative of the regions using the GCC or k-means algorithm. In the second level, these RNs later execute a dynamic low-energy adaptive clustering hierarchy (LEACH) algorithm to gather data during each round. In each round, only the representative nodes collect data using the dynamic clustering protocol.
The operation of our scheme is divided into rounds. Each of these rounds consists of 2 phases: a set-up phase and a steady-state phase. During the set-up phase, cluster-heads are determined and the clusters are organized. During the steady-state phase, data transference to the base station occurs. Our scheme works with rounds in the same way as a typical LEACH protocol. 1) At the beginning of the network setup phase, the sink advertises a broadcast packet including the information of correlated radius R c to all nodes, correlated clusters are formed using the GCC or k-means algorithm, and RNs are selected from all nodes. 2) In each correlated cluster, the RN receives the raw sensing data from other ordinary nodes (ON) and calculates the accuracy of the information between the RN and each ON to determine whether it meets the distortion constraint. If the distortion of an ON is larger than the threshold, the node is labeled as non-correlated node (NCN) and forms a new cluster independently.
3) The sink collects all information on the average distortion of every cluster and number of NCNs from RNs and determent whether R c is appropriate. If R c is not, an updated value of R c is broadcast again.

Performance of GCC and k-means
Using the scenario in Fig. 3, 200 nodes are randomly distributed in a region 200 × 200 m 2 , and the sink is set in the center. The GCC algorithm and the k-means algorithm are used to cluster the nodes. Each correlation cluster member in the cluster is connected with its cluster head, and the cluster head serves as the representative node to which is sent the collected data from other members. The correlation model is an exponential model as shown in Eq. (2).
The result of clustering GCC and k-means are shown in Figs. 4 and 5, respectively. We can see from Figs. 4 and 5 that nodes are well-distributed when the k-means algorithm is used.
The distortions of correlation clustering are show in Fig. 6. With the increase in the number of nodes, the average distortion tends to decrease gradually, both in GCC and k-means. This   means that a higher cluster density indicates a stronger correlation among data points which results in a better estimation.
These results show that when the number of clusters is small, the average distortion of the k-mean algorithm is larger, but the distortion becomes smaller when the number of clusters increases. This occurs because the k-means algorithm ensures that the average distance of the sensor nodes to their corresponding centroid is the same, so that the final location of the centroid is a given distance from each sensor node, resulting in less distortion.

Energy efficiency of proposed scheme
We evaluated the effectiveness of our scheme with simulations. In a simulation, N sensor nodes are randomly distributed in a square region 200 × 200 m 2 in size with a sink in the center of the region. The parameters used in the simulation are summarized in Table 3.
The results in Fig. 7 show the relationship between the number of sensor nodes that remained alive and the number of rounds. It can be seen from the figure that the life-span of the WSN using correlation is longer than that of a traditional LEACH.
In our scheme, after the node selection phase during each round, only the representative nodes remain active while non-representative nodes go to sleep. Consequently, the number of active nodes is much smaller than that of LEACH. Since only the RNs participate in the data dissemination, the number of data transmissions is greatly decreased. Therefore, the energy consumption is greatly reduced.
It also seen in Fig. 7 that the lifetime using k-means is longer than that using GCC, because the k-means can find the optimal cluster size to minimize the maximum distance between any point and its nearest centroid.

Conclusions
Spatial correlation between the data collectors is not only effectively used to ensure that the distortion lies within a certain range, but also to avoid transmitting too much data and consuming too much energy.
Our work shows that the energy consumption of the nodes can be decreased and the lifetime of the system increased with an acceptable level of distortion in data by exploiting spatial correlation, and the transmission of redundant nodes can thereby be controlled.
Future work includes the study of some adjustable scheme to achieve the adaptive correlated radius, finding a dynamic optimal correlated radius, and determining the optimal numbers of correlative clusters. Therefore, some approaches to efficient medium access and reliable event transport by exploiting spatial correlation in WSNs will also be considered.