NeuralIO: Indoor–Outdoor Detection via Multimodal Sensor Data Fusion on Smartphones

The indoor–outdoor (IO) status of mobile devices is fundamental information for various smart city applications. In this paper, we present NeuralIO, a neural-network-based method for dealing with the IO detection problem for smartphones. Multimodal data from various sensors on a smartphone are fused through neural network models to determine the IO status. A data set containing more than one million labeled samples is then constructed. We test the performance of an early fusion scheme in various settings. NeuralIO achieves an accuracy above 98% in 10-fold cross-validation and an accuracy above 90% in a real-world test.


Introduction
The past decade has witnessed the flourishing of the Internet of Things (IoT) and its applications in urban spaces. The widespread deployment of IoT devices and the rise of smart cities are giving birth to an increasing number of smart applications. (1)(2)(3)(4) Context status is critical and fundamental information for ubiquitous computing systems and context-aware IoT applications. (5,6) "Context" consists of a wide range of aspects such as location, time, surrounding environment, and so on. The rapid growth of smartphones is driving the increasing interest in context-aware applications. (7)(8)(9) One of the most fundamental items of contextual information is whether the device is in an indoor or outdoor environment, because it makes a significant difference if the user is standing in front of a shopping mall or inside a shopping mall. Furthermore, the availability and capabilities of different technologies vary considerably between these two environments. The knowledge about the indoor-outdoor (IO) status enables the use of appropriate technologies, which leads to a better user experience. For instance, a device can trigger a reminder, change the working mode, and switch between GPS-based navigation and indoor navigation schemes when the user enters or leaves an indoor environment. Furthermore, the device can save energy by turning off the GPS module in indoor environments such as a metro station. Existing IO detection approaches commonly use a GPS signal, (10)(11)(12)(13) a wireless signal (16,23,26,28) and other sensor data (15,17,20,21,27) to determine the IO status. Owing to the rich characteristics of natural phenomena, it is rare that a single modality provides comprehensive knowledge of the phenomenon of interest. (18) The increasing availability of multiple sensing modalities on smartphones offers us more freedom to recognize the context. The capability of neural network models has been proven superior in solving increasingly complex machine learning problems that often involve multiple data modalities. (22) In this paper, we propose NeuralIO to detect the IO status of smartphones through multimodal sensor data fusion using neural network models. We create a data set containing more than one million labeled samples involving nine users. Nine different sensing modalities, which are acceleration, GPS, light, magnetic field, proximity, cellular signal strength, sound level, temperature, and WiFi, are covered in the data set. We test the performance of an early fusion scheme in various settings.
To summarize, the contributions of this study are as follows. 1. We apply neural network models to the IO detection problem and perform a comprehensive analysis. 2. We implement an Android app for data collection and conduct experiments to collect data samples in various real daily scenarios. A data set containing more than one million labeled data samples is constructed. 3. We evaluate the performance of an early fusion scheme based on the data set through crossvalidation and a real-world test. An accuracy above 98% is achieved in across validation and an accuracy above 90% is achieved in the real-world test.
The rest of the paper is organized as follows. Section 2 presents related works. Different fusion schemes are introduced in Sect. 3. The experiment and data collection are described in Sect. 4 and evaluation results are presented in Sect. 5. We conclude our work in Sect. 6.

GPS
GPS signals are highly dependent on the line-of-sight (LOS) paths between the device and GPS satellites. It is well known that GPS signals are poor in indoor environments as the LOS paths of GPS signals are blocked. On the other hand, the LOS paths are not blocked in most outdoor scenarios. On the basis of these facts, the localization accuracy of GPS or the availability of GPS signals has been exploited to determine whether a device is in an indoor or outdoor environment. (10)(11)(12)(13) Despite the intuitive nature and easy implementation of GPS-based methods, they suffer from several disadvantages. Radu et al. identified the GPS chipset as the sensor with the highest power consumption among the evaluated sensors. (21) The battery capacity is still limited in state-of-the-art mobile phones and most users dislike applications that drain the battery. Secondly, the intuition behind these methods is not always reliable. For instance, GPS signals are reasonably strong if a device is in an indoor environment with large windows. In contrast, GPS signals can be blocked by surrounding mountains if a device is in a valley. Under these circumstances, GPS-based methods may give misleading results. A third disadvantage is that 3 it normally takes around one minute to launch the GPS module, making GPS-based methods unsuitable for real-time applications.

Wireless signals
Shtar et al. (23) presented a method of continuous IO environment detection on mobile devices based solely on WiFi fingerprints assumed no prior knowledge of the environment. The model trained with the data collected for only a few hours on a single device was applicable to unknown locations and new devices. WifiBoost (16) used a machine learning meta-algorithm that combined an adequate ensemble of simple classifiers (so-called weak learners) to improve the overall performance. An average error rate of around 2.5% was achieved in the evaluation. However, a classifier should be created for each building and the surrounding area through measurements and labeling of each measurement point, especially in those cases where there was no previous fingerprinting database. Building such a database is not a trivial task.
Wang et al. (26) applied a machine learning algorithm to classify the signal strengths of neighboring cellular base stations in different environments and identified the current context by signal pattern recognition. An accuracy of 100% was reported for the identification of open outdoors, semi-outdoors, light indoors, and deep indoors.
In Ref. 28, low-power iBeacon technology was leveraged to develop an accurate, fastresponse and energy-efficient scheme for IO detection. The transitions between outdoors and indoors were detected by comparing the received signal strengths of two predeployed Bluetooth beacons on two sides of each entrance.

Multiple sensors
Since a single sensor might not be able to tackle all application scenarios, data from multiple sensors such as accelerometers, proximity and light sensors, wireless receivers, and magnetometers were exploited for IO detection. (15,17,20,21,27) IODetector (27) combined data from three lightweight sensors (light, cell tower signal strength, and magnetic sensors) to develop an extensible IO detection framework that did not require a training phase. Although acceptable error rates were achieved, Radu et al. (21) criticized IODetector for its hard-coded thresholds that might not work with new devices and environments. As an alternative, they proposed a semi-supervised training method to improve IO detection accuracy across different devices and environments.

Other methods
In Ref. 19, the embedded digital camera on a mobile phone was utilized for IO detection. The developed gentle boosting classifier achieved error rates of 1.7% for indoor scenes and 10.8% for outdoor scenes. In addition, a feed forward neural network was trained with the gist feature of images to address the IO detection problem. (25) These methods help to generate semantic IO labels for images but do not work for tracking and in other real-time application cases.
Sung et al. (24) developed a sound-based IO detection method using a chirp signal. A simple classifier was developed with a static threshold. However, this work was rather simple and straightforward, and no comprehensive analysis was performed. Wang et al. conducted a comprehensive study on an audio-based IO detection method. The method was evaluated in various scenarios with different probing signals (MLS and chirp), noise levels, and device types.

Fusion Scheme
Neural networks offer the flexibility of implementing multimodal sensor fusion either as early, late or intermediate fusion. (22) As shown in Fig. 1(a), the early fusion scheme data from multiple sources are integrated into a single feature vector to serve as the input of one machine learning model. In contrast, the late fusion scheme aggregates decisions from multiple models that are trained separately on their own modality as shown in Fig. 1(b). This fusion architecture is often favored because errors from multiple classifiers tend to be uncorrelated and the method is feature-independent. (22) For traditional machine learning methods, it is typically necessary to manually extract features from each modality, which is not only time-consuming but also challenging. Neural networks are known for being able to learn features automatically. In this paper, we use the feedforward neural network (FNN) model to conduct early fusion for the IO detection problem.

App design and implementation
We have developed an Android app for data collection. The Android app is implemented with Android Studio. The target version of the application is 27 with a minimum version of 19. This covers the Android smartphones of all participants in this study. The app needs to access multiple sensors on the smartphone and save the sensor readings to a database. The collected data comprise the battery temperature, luminance, magnetic flux density, proximity, cellular signal strength, and cellular network bit error rate, an abstract level for the overall signal strength ranging from one to four, the number of WiFi networks around the user, the highest signal strength of the WiFi networks around the user, the number of GPS satellites, the GPS accuracy in meters, the GPS signal-to-noise ratio, and the ambient noise level. Additionally, some anonymous information about the device is also recorded to distinguish different data traces. It is crucial for the user interface of the smartphone application to ensure that the user can modify data labels or remove the collected data since they may make mistakes when they log data. Also, the process of starting and stopping the data collection should be fast and simple for the user. Figure 2 shows a screenshot of the developed app. The user specifies whether he/she is indoor or outdoor and inputs the current weather condition. Then, he/she has the option to provide notes on the location and his/her name. The user starts the logging period for either 10 min, 30 min, or an unlimited amount of time. If, for example, the user walks indoors while logging data labeled outdoors, he/she has the option to invalidate the last 5, 15, or 30 min of the collected data. The user can stop the logging process at any time. The application collects the specified information every 200 ms as one JavaScript Object Notation (JSON) object. The data is then sent to an instance of the Firebase Realtime Database (DB). (14) This ensures that every user directly writes to the same database and no data is saved locally on user's device. From there, the data can be downloaded for further processing. This process is displayed in Fig. 3.

Data collection
The smartphone application was handed out to multiple participants for data collection. The users were instructed about the application and how to use it. The data collection ran for four weeks and users were free to choose the time and environment for data logging. Nine users  participated in the data collection campaign and various models of smartphones were used for data collection. The users collected the data in their daily life in both urban and rural areas. This ensures the diversity of the data set. Figure 4 shows typical data logging scenarios.

Overview of data set
The distribution before cleaning for different smartphones is illustrated in Fig. 5. Different smartphones also represent different users.
By removing the samples invalidated by the users themselves, 1019091 samples are left; this number of samples is equivalent to about 56.5 h of data. However, not every collected sample is completed for various reasons. Figure 6 illustrates how many samples are missing for each sensing modality. After removing the incomplete samples, the resulting data set includes 623320 samples, which correspond to around 34.5 h of data. The balance between indoor and outdoor samples is now 43.98 to 56.02%. Figures 7-9 show the data distribution regarding  rural/urban environments, weather conditions, and time, respectively. We can see that the data set contains various data samples with a balanced distribution.

Cross-validation
We used 10-fold cross-validation to evaluate the performance of the constructed model with various numbers of hidden units and layers. Finally, we obtained good balance between performance and model complexity by using the architecture in Fig. 10. The input layer with 24 input nodes is omitted owing to the limited space. There are four hidden layers with 10, 5, 4, 3 hidden units with the Relu function as the activation function. The output unit uses the sigmoid function as the activation function. As shown in Table 2, the results of 10-fold cross-validation demonstrate that the model performs very well in nine out of 10 folds; in the fifth fold, the model only achieves an accuracy of 0.73. This is probably due to the loss function becoming trapped at a local minimum.

Real-world test
To verify the performance of the model in the real world, we tested the trained FNN model on the real-world data set recorded around two months later than the training data set. During the collection of the data set, the user walked through the city as depicted in Fig. 11. The trace covers indoor environments, such as campus buildings and shopping malls, and outdoor environments, such as streets.
The confusion matrix is shown in Fig. 12. Generally, the model performed well in the real-world test with an overall accuracy of 91%. Specifically, the model can recognize indoor cases with a precision of 96% with 4% falsely classified as outdoors. The model achieves a precision of 88% in outdoor cases with 12% of all outdoor cases falsely classified as indoors. The model shows good generalizability on the new data set. To investigate the cause of the misclassification of the model, we plot the labels of all data entries against the index in Fig.  13. As shown in Fig. 13, there are some isolated misclassifications in both indoor and outdoor cases. Making the common-sense assumption that it is very rare for people to switch between indoor and outdoor states in a short time period (for instance, 2 s), we can use a majority voting strategy with a sliding window to filter out the isolated misclassification cases. The basic idea is that the IO state is not only determined by the input data, but also depends on the previous predicted labels in the sliding window. As shown in Fig. 14, there are fewer isolated misclassification cases after applying the majority voting strategy with a sliding window of 10. The confusion matrix in Fig. 14 also shows an increase in precision in both indoor and outdoor cases.

Conclusions
We developed NeuralIO, a neural-network-based multimodal fusion method for the IO detection problem on smartphones. A data set containing more than 1 million data samples was constructed. Nine different sensing modalities were covered in the data set. We built a feed forward neural network model for the early fusion of all available raw data. Cross-validation and a real-world test have shown the feasibility of our developed method for indoor-outdoor detection and generalizability on a new data set.