Image-retrieval Method Using Gradient Dilation Images for Cloud-based Positioning System with 3D Wireframe Map

1Department of Mechanical Engineering, Graduate School of Science and Engineering, Kagoshima University, 1-21-40, Korimoto, Kagoshima, Kagoshima 890-0065, Japan 2Department of Integrated Information Technology, Graduate School of Science and Engineering, Aoyama Gakuin University, 5-10-1 Fuchinobe, Chuo, Sagamihara, Kanagawa 252-5258, Japan 3Department of Architecture & Architectural Engineering, Graduate School of Science and Engineering, Kagoshima University, 1-21-40, Korimoto, Kagoshima, Kagoshima 890-0065, Japan


Introduction
The Internet of Things (IoT) is the network of physical objects embedded with sensors, actuators, software, and connectivity that enables these objects to be connected and to exchange data. The IoT is a highly versatile technology that is expected to pervade many situations and environments, such as a home and a public space, by connecting resources and demands to supply efficient services. (1) Moreover, the location information of resources and demands, which indicates the position information of objects related to the service, is very important in supplying a physical service. (2) However, the positioning technology suited for the IoT context has not developed sufficiently. The requirements for IoT positioning are precision necessary and sufficient for applications, response speed suited to applications, long-term stability, low initial cost, low maintenance cost, low running cost, and the universality of coping with various sensors. There are no technologies or studies to meet the above requirements at the same time.
The problem of positioning a mobile object has been studied in the robotics for a long time. When a robot needs to localize itself without knowledge of its environment, it must create a map of the environment at the same time. Dissanayake et al. formulated this problem as simultaneous localization and mapping (SLAM) and Thrun et al. organized solutions for SLAM. (3,4) Then, many SLAM-based localization solutions using a camera and a light-detection-and-ranging sensor (LiDAR) have been proposed and developed, and have recently become a key technology of self-driving vehicles. (5,6) However, the large computational load and weight of sensors could be a bottleneck to applying sensors to small objects in IoT applications.
An intelligent environment approach, in which an environment is customized by embedding with physical equipment, has also long been in development. In automated factories and large warehouses, automated guided vehicles (AGVs) had been introduced early. (7) First, physical guides made from metal were adopted, and they subsequently become a wire and magnetic and visual types, so as to reduce the cost of resetting. As a visual marker, since QR codes are easy to install and are able to provide precise positioning, many researchers are studying for various applications in human living areas as well as at production sites. (8)(9)(10) Regardless of the easy installation, the maintenance cost cannot be cut, making it difficult for the QR code approach to contribute to the IoT scenario. The method using Wi-Fi signals surrounding a mobile object does not require the maintenance of devices physically. (11) However, the mapping process is needed on a periodic basis, but the accuracy is low. The maintenance and running costs and low accuracy are fundamental concerns in the intelligent environment approach for pervading human living areas.
For realizing a positioning system with minimal maintenance and running costs as well as low computational burden on mobile objects, we have proposed a cloud-based positioning infrastructure system named Universal Map (UMap). (12) The merits of the UMap are that it does not require any physical infrastructure in the environment, and the requirements of sensor and computational resources on the client are minimal. The main research issues are how to prepare a map on the server and how to localize a query data from a client. These problems have been studied in the field of computer vision. Visual localization in large-scale environments is often dealt with as an image retrieval problem. In urban environments, a geotagged image database (DB) is prepared beforehand, and the query image location is matched with the geotag information of the most similar image retrieved from the DB. (13)(14)(15) However, the main demerit of this method is that it predicts only an approximate location of the query, not an accurate 6-degree-offreedom (DoF) position. Another approach is to predict the 6-DoF camera pose with respect to a pre-constructed 3D map. The map usually consists of a 3D point cloud built via the Structurefrom-Motion (SfM) method (16,17) or using data measured with a red-green-blue-depth (RGBD) sensor. (18) The query pose is predicted by feature matching and solving a Perspective-n-Point (PnP) problem. The demerit is that the map built at a time will become unusable after a certain degree of environmental change has occurred.
In contrast to the above 3D map, the UMap is composed of 3D wireframes and surfaces transformed from an architectural 3D CAD model. Once the 3D wireframe map is constructed, it can be useful permanently unless the building suffers from damage. The query image is converted to a line segment image and the camera position is predicted by retrieving the most similar line segment image in the DB generated from the 3D model. In this paper, we propose a new method of using gradient dilation images for efficient retrieval. The blurred lines make the retrieval process robust to pixel gaps in the image caused by camera position gaps. Because of this effect, the prediction accuracy is expected to improve and the grid interval between DB images can be wide so that a smaller DB can be organized.
Our contributions are threefold. First, we develop a generator for arbitrary perspective 2D line segment images to organize an image DB. Second, we develop an algorithm of gradient dilation transform to make a blurred line segment image. Third, we develop a pixelwise-AND-based similarity-evaluation algorithm working on a graphic board for parallel computing. All methods are validated by detailed experiments.

Overview of cloud-based positioning system
The system overview of the cloud-based positioning system named UMap is drawn in Fig. 1. The UMap consists of three subsystems: a central server that maintains a 3D wireframe map; clients who access the server to obtain their own positions; agents who detect and report environmental changes to the server. The client usually uploads the newest sensing data, then the server localizes the sensing data in the 3D wireframe map. Finally, the localization result is downloaded to the client.
The UMap has been developed multidirectionally. Various types of data, such as a standard camera image, (12) an omnidirectional camera image, (19) and 3D line segment data from an RGBD sensor (20) and a LiDAR sensor, (21) are confirmed to be query data for the UMap. To improve performance, the restructuring of the DB (22) and an investigation of the allowance error between a map and an actual environment (23) have been performed. Moreover, the agent part has been developed where the structure edge from the 3D wireframe model and color edges detected from the borders of a poster are integrated into the hybrid map to cope with environmental changes. (24) In this paper, we deal with the server-client part of the UMap. The workflow of the proposed method is drawn in Fig. 2. A sensing data, which is an image taken by client's camera, is sent to the server. The server receives the image and detects line segments. Then, the line segment image is used as a query for the retrieval process. The image DB is created by a DB image generator (DBIG) beforehand. The DBIG reads a 3D CAD model and generates an arbitral projection image. The details of each process are described in each subsection.

Sensing and inquiry processes on the client side
A problem of posing a rigid body in 3D space generally has 6 DoFs. The UMap basically can cope with a 6-DoF positioning problem. On the other hand, the geometrical condition of constraint depending on the application reduces the DoF. In this study, we assumed a client module set on a cart, as shown in Fig. 3(a). In this case, the UMap deals with a 3-DoF posing problem: predicting values for the x-axis, y-axis, and θ angle (horizontal angle) when the z-axis, vertical angle, and roll angle are constant.

Pre-processing to the sensing image for retrieval process on the server side
The server executes a line segment detection process to the uploaded camera image immediately. Any line detection algorithms can be used for this process, such as a Hough transform, a canny method, and a line segment detector. On the basis of our pilot experiment, we found that the line segment detector was best suited for our system. Examples of a sensing image and their line segment images are shown in Fig. 3(b). Each line segment image is resized to the DB image before the retrieval process.

3D CAD models for prior map
We built 3D CAD models of an actual building on the basis of their 2D design drawings and manual measurements. During the construction of CAD models, we noted that there are unignorable differences between the 2D design drawing and the actual building. For example, we noted that a door in the actual building does not exist in the drawing and the end of the wall is short compared with the drawing. To correct these differences, the manual measurement process was used. In Fig. 4, the 3D CAD models constructed are shown.

DB of 2D images with correct position information
For query image localization, the important features of the 3D CAD models are structural boundaries of the building between wall and wall, ceiling and wall, floor and wall, door and wall, and window and wall. These boundaries are projected and drawn as line segments in an image when the viewpoint and view direction with several camera projection parameters are given. The viewpoint and view direction represent the 6-DoF position in the 3D-map coordinate system. Therefore, the problem of query image localization is converted to the problem of searching for the reasonable viewpoint and view direction for projecting the line segment image, which is similar to a query image. Moreover, the problem of searching a view- point direction is the same as retrieving the best similar image in an image DB consisting of many drawn images with various viewpoint directions. We developed a DBIG for the efficient drawing of line segment images with specified camera parameters. The DBIG is an application executed on Windows and based on the Open GL library. Table 1 shows the camera parameters for drawing an image. When these parameters are given, the DBIG can draw an image as if a camera has taken a picture in the 3D wireframe map. Moreover, the DBIG accepts various ranges of axes and grid intervals to automatically generate images for the DB. The required parameters for this process are described in Table 2. For example, in the case of 0 ≤ x, y ≤ 10, 0 ≤ θ ≤ 360, dx = dy = 0.1, z = 1.2, dθ = 10, and ϕ = ψ = 0, 360000 images are generated. A DBIG scene that displays camera positions for the DB and a sample picture are shown in Fig. 5.

Gradient dilation transform for efficient similarity evaluation
The image retrieval process is conducted as a similarity evaluation. In our previous method, (12) the similarity between two line segment images was equal to the total number of surviving pixels, after pixelwise logical conjunction. However, this method is too sensitive to the position gap. When the shooting position of the query image moves slightly (for example 1.0 cm), the line segments in the image move more than 1 pixel. Even if the camera movement is small, in the images, the line segments that overlap each other do not overlap after the movement. This phenomenon possibly makes the DB image with the shooting position geometrically closest to the query image becomes lower in similarity than the other DB images.
To solve the above problem of hypersensitivity to viewpoint shifts, the blur process is used. We first applied the distance transform for the blur process and confirmed that the distance-transformed images are robust to viewpoint shifts. (25) However, the distance transform approach has difficulties in limiting the dilation width and in designing an arbitrary gradient. Therefore, we developed a new blur process named gradient dilation transform. We assume that a line segment image drawn in gray scale, where the color intensity of a pixel on line segments is 1 and that of a background pixel is 0. Let p i, j DB denote the color intensity of a target pixel (i, j) and q the distance from the nearest pixel on line segments. Let q w denote the width of dilation and p limit the lower limit of color intensity. Then, the color intensity of each pixel in the transformed image is given as Examples of gradient dilation images are shown in Fig. 6.

Pixelwise-AND-based similarity evaluation
The similarity evaluation between a query image and a DB image is conducted by pixelwise AND calculation. Let p i, j query , p i, j DBk , and p i, j AND denote the color intensities of a target pixel (i, j) in the query, k-th DB, and resulting AND images, respectively; then, the resulting pixels are given as (2) Figure 7(a) shows an overlay image of a query image, a k-th DB image, and a resulting AND image. The pixels in the overlay image can be classified into 6 classes, A, B, C, D, E, and F, as shown in Fig. 7(b). Let num(X) represent the total number of member pixels in class X and int(X) the sum of the color intensities of member pixels in class X. Then, we define the similarity s k between the query image and the k-th DB image as The image in the DB that has the maximum similarity is regarded as the best-matched image and its position is adopted in the prediction result of the positioning system.

Experimental environment and conditions
We conducted 2 types of experiment. In experiment I, we investigated the maximum performance of our method assuming sufficient computational resources. The validity of the similarity index was evaluated in this experiment. In experiment II, we investigated the practical performance. Every experiment was conducted in the environment of the 5th floor of the O-building in Aoyama Gakuin University (AGU) [ Fig. 4(a)]. We used a smartphone device (Lenovo Phab 2 Pro) as the client module in both experiments. The camera parameters are α = 74.6 degree, W img = 320 pixel, and H img = 180 pixel.
The performance of the positioning method was basically measured by the error distance between the predicted position and the groundtruth. Since our method can predict the view direction as well as the view position, we extended the error distance to the error norm. Let q and b k denote the coordinate value vectors of the query and k-th DB images, respectively, then the error norm e k is defined as where the unit of x, y, and z is meter, and the unit of θ, ϕ, and ψ is radian. Note that we presume that an error of 0.1 rad (≈ 5.7 degree) equals an error of 0.1 m.
(1) Experiment I: In order to deeply investigate the relationship among the parameters of the DB, similarity index, and error norm, we limited the target area to 4 m 2 and conducted experiments with scrupulous configurations. We prepared 200 types of DB in total with various grid intervals, angle intervals, and gradient dilation widths. We also prepared 10 query images randomly. These query images were virtual sensing images. Then, the position of each query image was predicted by our proposed positioning method. The positions of virtual cameras for the virtual sensing image and several arrangements of virtual cameras for DB images are shown in Fig. 8. (2) Experiment II: We used a standard PC (Core i7 3.7GHz, 32 GB RAM) with a graphic board (GTX 1080 Ti 11 GB VRAM) as a server machine covering the target area. To investigate the environmental characteristics, two areas were selected as experimental fields. Area A was a long corridor with an area of 70 m 2 that included highly symmetric and repetitive elements. Area B was a T-junction of the corridor that appeared asymmetric. With the size of a DB image assumed to be 320 × 180, the maximum number of DB images that can be loaded on the graphic board at one time was 100000. This limitation was due to the VRAM size of 11 GB and our implementation method. We prepared 6 types of DB for each area. The details are described in Table 3. For this experiment, 250 query images were prepared randomly for each area. The location of each area and virtual camera positions for virtual query images are shown in Fig. 9.

Results and discussion
(1) Experiment I: Figure 10 shows (a) a typical overlay image and (b) a graph of the similarity and the error norm versus all DB images when the position of the query image is predicted. It is confirmed that the predicted DB image, k = 595, with the maximum similarity is very close to the correct answer, k = 585, the error of which is minimum. Although the predicted number is not the optimum, the similarity curve appears to be smooth and unimodal, and it is expected that the method can predict a value close to the optimum. In other cases, the similarity graph tends to be unimodal. This finding supports the validity of the similarity index defined as Eq. (2). Table 4 shows the aggregate results chosen from all prediction experiments with every prepared query and DB. In the most accurate case, where dx = dy = 0.1, dθ = 1.0, and q w = 10 pixels, the average error norm is 0.075 m.
(2) Experiment II: Figures 11(a) and 11(b) show graphs of the cumulative frequency of prediction results. In the case of area A, the prediction accuracy is expectedly not very high. Because of the symmetric appearance and many repetitive elements, the similarity of an incorrect position DB image becomes incidentally higher than that of an adapting correct DB image.
Although this problem is difficult to solve using only visual information, the consistency  of movement or another sensing modality, such as Wi-Fi signals, will lead to a solution. In the case of area B, it is confirmed that the parameter setting of the DB has a considerable effect on prediction accuracy. To improve the accuracy, the grid and angle intervals should be small, and the gradient dilation width should be set to 10 rather than 0. In the best case, where dx = 0.4, dθ = 2, and q w = 10, the rate of images predicted with error under 0.5 m is   around 80%. One of the reasons why some images are matched with the wrong DB image is the miss-shooting of query images. Since the virtual sensing images are generated randomly, some images are not suited to positioning. For example, if the image is taken very near the front of a wall, the image tends to become all white without any line segments. Actually, the query image data set includes some all white or comparable images. In the practical case, multiple sensing can alleviate this problem. The round-trip time from the time when the client uploads an image to the time when the client receives the predicted position information was below 1.0 s in all experiments. We confirmed that the system can be used in practical applications, owing to the use of a graphic board.

Conclusions
We proposed a cloud-based positioning system using a 3D wireframe model as a map. To localize the query image taken by the client in the 3D map, we developed a 2D image generator. This generator reads the 3D wireframe model and outputs arbitrary viewpoint images consisting of line segments efficiently to organize an image DB. The positioning problem is converted to an image-retrieval problem, that is, finding the most similar image to the query image from the DB. To enhance the image similarity evaluation process, we developed a new image blur method named gradient dilation transform, which is suited to blurred line segments with detailed tuning. We also developed a method of evaluating the similarity between two line segment images on the basis of pixelwise AND. This process can be implemented on a graphical processing unit with calculation by parallel computing.
We conducted two types of experiment and confirmed that the smallest average error is 0.075 m in an ideal setting. In the case of a T-junction of the corridor that appeared asymmetric, 80% of the query images are successfully predicted at an error less than 0.50 m; the round trip time is below 1.0 s in all experiments.
One of the topics for future study is evaluation with real sensing query. We confirmed, in a pilot experiment, that real sensing data could be predicted correctly like virtual sensing query. We will conduct the experiment as soon as the real sensing dataset is ready.