Household Goods Recognition Using Hierarchical Multi-object Segmentation

1College of Mathematics and Information Engineering, Longyan University, Fujian 364012, China 2Department of Computer Science and Information Engineering, National Ilan University, Ilan County 260, Taiwan 3Department of Electrical and Computer Engineering, Tamkang University, New Taipei City 251, Taiwan 4Department of Chemical and Materials Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan 5Department of Aeronautical Engineering, Chaoyang University of Technology, Taichung 413, Taiwan


Introduction
With the increasing maturity of robot technology, robots can be applied in a wider range of fields. The first generation of robots were industrial robots, the second generation of robots had the technology of sensing, and the third generation of smart robots have smaller volumes and are integrated with computers. Smart robots that can carry out services (service-type robots) are key development projects in developed countries, and robots are being developed for home care, security, environment cleaning, interactive learning, medical care, and fire prevention and rescue. Recently, the requirements and applications of service-type robots have become increasingly important. The key technologies being investigated and developed for service-type robots include mechanical drive technology, environmental perception technology, smart control software, robot vision processing, and embedded systems. For robots to be used in a home environment, the main research areas include improving the ability of robots to interact, communicate, and cooperate with humans so that robots can adapt to home environments, understand the tasks that people want them to do, and complete the tasks rapidly and safely.
Home robots are no longer limited to futuristic TV shows or movies: robot companions, personalized assistants, and home management aids have been steadily improving since Roomba first hit the store shelves in 2002. Liu et al. reviewed the evolution of robotic research and development over the past 50 years, and they defined home service robots in terms of three major categories: robot manipulators, mobile robots, and biologically inspired robots. (1) Zachiotis et al. provided a thorough overview of state-of-the-art (SOA) solutions available in home service robotics, and their detailed analysis of consumer-oriented robots suggested future demand for robots in the areas of entertainment, education, social purposes, gaming, and households. They also found that research-oriented robots were focused on the purposes of entertainment, development platforms, security, and household/rehabilitation. (2) Because of the rapidly growing population of elderly people, the need for healthcare is on the rise. Ramoly et al. investigated a framework for service robots in smart homes. This framework included robotics and smart environments, and provided a promising solution for monitoring in which a robot interacts with and provides companionship to users. They found that sensor data is not perfect in real scenarios because the environment changes over time, and they tackled these problems in order to improve the autonomy and efficiency of robots in smart environments. (3) Owing to the rapid progress of the robot industry and Taiwan's aging society, there has been increased interest in employing robots to perform some of the healthcare and domestic tasks in the home. In the home, many tasks can be performed by robots, making machine vision very important because of its use in image analysis to allow robots to make judgments from input images. In the home environment, the recognition of household goods is very important. In this study, we used the home environment as the main axis of technological development, which could help housekeeping robots to identify items correctly in the home. Nowadays, most object identification algorithms are dependent on a constructed database or training and learning processes using many objects. When household items are not contained in the database, then robots need off-line algorithms to manually construct models of these items, which will make it more difficult for robots to identify objects in the environment.
The theory of graph cuts (GrabCut) was first used as an optimization method in the computer vision field, and it is an object segmentation algorithm based on graph cuts. GrabCut starts with a user-specified bounding box around objects to be segmented, then estimates the color distributions of target objects and the background using Gaussian mixture modeling (GMM). Basavaprasad and Ravindra investigated an improved GrabCut algorithm for object segmentation that combined the technologies of statistics and graph cuts, and their algorithm accomplished detailed object segmentation with a suitable input. (4) Kang et al. proposed an object segmentation method based on an improved non-interactive GrabCut algorithm, in which they used bilateral filtering to preserve edges and for noise reduction. (5) In this study, we proposed an algorithm that combines a depth image, household goods segmentation, and model construction with GrabCut, and we used a hierarchical design for the segmentation of items. The proposed algorithm uses hierarchical multiobject segmentation technology to sense household goods and recognize them for further applications, and the recognition results can be used in different fields, for example, robot applications, virtual reality, automated tracking systems, and 3D movies. Thus, this work can also be applied to optical sensors and imagers. The depth image is used to find the approximate locations and sizes of multiple objects in the coarse layer, then GrabCut is used as fine segmenting algorithm to extract the edges of objects. This means that the proposed algorithm can be used to segment multiple objects and construct models. Our novel object recognition algorithm can automatically construct models for static household items in a non-stationary background. Also, the information completeness of recognized objects is close to that obtained with manually built models, and when the database is upgraded, the images can be used to achieve an acceptable object recognition rate.

Stereo vision
The application fields of stereo vision are wide; for example, it can be used in robot applications, virtual reality, security monitoring systems, automated tracking systems, 3D human-computer interaction interfaces, and 3D movies. Before stereo vision is applied in these fields, target objects must be sensed and 3D information of the object and the environment must be acquired. A depth image can give us "depth" or "z" information of objects in the real world, and the intensity values in the image represent the distances of the objects from a viewpoint. Generally speaking, when we acquire a depth image, we can obtain the depth information at the same time, with the brightness of the image pixels expressing the parallax; higher pixel values suggest that objects are closer and lower pixels suggest that objects are farther away, as shown in Fig. 1.

Image segmentation
The purpose of object segmentation is to gather pixels of the same type into cluster regions, which represent different surfaces, objects, or parts of an object. There are many object segmentation technologies, including object recognition, mobile object detection, depth images, and template comparison. When an object is segmented, it can be subdivided into its constituting areas or objects, and the degree of subdivision is dependent on the problem to be solved. This means that once an object of interest has been segmented, the segmentation process should be stopped. An object segmentation algorithm is usually based on the intensity values of two basic characteristics: discontinuity and similarity. The first kind of algorithm uses sudden changes in the image gradient to segment images. The second kind of algorithm segments images into similar regions according to predefined criteria. The critical value method, seeding region growth method, seeding region segmentation method, and image-merging method are all examples of the second kind of algorithm. GrabCut is a 2D image segmentation algorithm used in general applications. (6)(7)(8) Users of GrabCut only need to drag the selected input images and then to roughly divide then into the foreground and background, as shown in Fig. 2. The main steps of the GrabCut algorithm are listed below: (1) Users need to input two or three conditions: the foreground and/or background and the unknown regions. Generally speaking, the image in a box (region of interest) is marked as the unknown region and the image outside the box is marked as the background.

Image matching
Image matching is one of the key technologies used in many applications of computer vision including object recognition, analysis of 3D internal modeling, stereo matching, and motion tracking. The scale-invariant feature transform (SIFT) can clearly describe the feature points of an image and describe the images and objects in various situations. (9)(10)(11) The feature points include the image scale, image whirling, partial brightness, and multivision invariance. These feature points can have better distributions in the spatial and frequency domains, and they decrease the probability of matching failure, which is caused by masking and noise. When an effective algorithm is used, it can extract a large number of feature points from an image. Because the feature points have a high-level uniqueness, an effective algorithm can provide a large volume of feature points to obtain the correct similarities between objects in the images in a database. The following are the main steps in generating the feature points: (1) Detection of the limited spatial dimensions.
(2) Confirmation of the feature points.
(3) Description of the feature points. By following these steps, the feature points used in SIFT can be obtained.

Hierarchical Multi-object Segmentation Algorithm
The main purpose of this study is to design an algorithm for home environments that can automatically detect and segment multiple home objects in an image at the same time. For this purpose, we will introduce the proposed hierarchical object segmentation algorithm to solve the problem of how to separate stationary home objects under a non-stationary background. In the proposed algorithm, the hierarchical model is divided into a coarse layer and a fine layer, and both are used to construct the image model and complete the segmentation automatically.

Modeling of coarse layer
The coarse layer uses the depth image as the base, then it removes the background and segments the independent objects via hierarchical statistics, as shown in Fig. 3. Through compensation of the morphology and conditional processing, we filter the non-required objects or foreground and enlarge the object's size range, which is beneficial for the process of back-end segmentation to result in the object's inward convergence. (12)(13)(14) The process of coarse-layer modeling segments the objects in the home one by one and marks them. The segmentation results are not the real edges of all segmented objects but they can cover the objects' information. The image's information is used in the proposed fine-layer segmentation for further edge convergence, in which the edges of objects are converged to the edges of real objects. The segmentation results are close to those obtained manually.

Modeling of fine layer
In the fine layer, we use GrabCut as the algorithm and incorporate a suitable foreground and background to evaluate the performance of the segmentation process. The coarse layer obtains a suitable model of the foreground, although contours are not complete, and GrabCut is used for edge convergence on the foreground of interest. Using GrabCut, the pixels of fixed images can be set as the foreground or background, and it is possible to set the object as the foreground. When we set different attributes of pixels on the fixed image, the weights of edges between the foreground and background modeling and each point of pixels are influenced, which may influence the segmentation results. If the accuracy of foreground and background modeling is improved, the segmentation results will more closely match the optimum manual segmentation results. Figure 4(a) shows an original image before segmentation, Fig. 4(b) shows the result of manual segmentation of selected household goods, and Fig. 4(c) shows the image of selected household goods segmented manually then automatically segmented by the proposed algorithm. As shown in Fig. 4(c), the image obtained by automatic segmentation is similar to the manually segmented image and has a high level of perfection of about 99%.

Construction of image model
This purpose of this study is to investigate an algorithm for automatic segmentation of the target object and to apply this algorithm for image modeling. However, the most basic method of image modeling is to save the segmentation image and make a comparison, then the imagematching process is used for further recognition and to update or delete the duplicate image sample. There are many image-matching processes, including the template-matching method, (15) contour or shape comparison method, histogram comparison method, (16) and feature point matching method. (17,18) However, Mikolajczyk and Schmid have proven that in many object recognition algorithms, when the feature points are constructed by a SIFT-based algorithm, they are the most stable in the cases of image interference, object rotation, and affine transformation. (19) In an experimental environment, an object is placed randomly, and the distances between different objects and two cameras are not stable. To overcome this problem, the SIFT algorithm extracts the feature points of a segmented object image and updates them in the database, and it uses the complete image or feature points to replace the incomplete ones or add the object image to the database. A characteristic of the SIFT algorithm is that it has reasonable robustness against changes in scale, rotation, vagueness, brightness, and affine transformation, and the extracted high-dimensionality feature points have improved robustness. Figure 5 shows the result of matching feature points using the SIFT algorithm.

Experimental Results
The equipment used in our experiment to evaluate the proposed algorithm included a desktop computer equipped with an Intel R Core TM 2 Q9550 (2.83 GHz, 2 GB RAM); the running program was compiled by Visual Studio 2008, the operating system is Microsoft Windows XP SP3 (32 bits), the resolution of the sequence of images is 640 × 480, and the image format of the RGB color system is 24 bits. The proposed algorithm was evaluated in an indoor home environment. Five experimental scenes were constructed, as shown in Fig. 6, which were used to demonstrate the suitability of the proposed algorithm for identifying most household objects.

Correct coverage ratio
In this study, we use the possibly correct and correct coverage ratio of the foreground as the correct coverage ratio of the coarse layer. If the foreground is correctly covered, we call this result a true positive (TP), and if the foreground is not correctly covered, we call this result a false negative (FN). Figure 7 shows a schematic diagram of the coverage conditions. From Table 1, we can see that there are many causes of an FN: one is the mismatch on the foreground of the non-object, which will cause the recognition area or the amount of noise to be too large. As a result, the mechanism to filter objects with too small areas cannot remove the object's foreground or noise.

Coverage accuracy
The above demonstration has proven that an image processed by the coarse layer can achieve high coverage factors for most home objects. In this study, the use of GrabCut in the fine layer is proposed to further process the image, for example, to delete incorrect image information and retain correct image information. Table 2 shows the definitions of TP, FN, false positive (FP), and true negative (TN) in this study. TP represents the pixel of a real foreground object that is a pixel of the foreground of the output after segmentation. FN represents the pixel of a real foreground object that is mistaken for a pixel of the background of the output after segmentation. FP represents the pixel of a real background object that is mistaken for a pixel of the foreground of the output after segmentation. TN represents the pixel of a real background object that is the pixel of the background of the output after segmentation. The accuracy rate is defined as In this study, we use a manually segmented image as the standard, and through observation, we find that the accuracy rate of automatically segmented images is up to 99% when the manually segmented image is considered as the correct output. The accuracy rate in Eq. (1) is obtained by a comparison of a manually segmented image with an automatically segmented image. Figure 8 shows schematic diagrams of TP and FN + FP. Table 3 gives the average accuracy rate for the five scenes shown in Fig. 6, and the segmentation result of scene 1 is shown in Fig. 9. Comparing the results of automatic and manual segmentation of the image,    we found that the feature points generated by the SIFT algorithm confirmed the sufficiently correct information and sufficient robustness of the segmentation image for object recognition applications. The recognition in this study is dependent on the algorithm proposed by Silva. (14) If the data of the test object is input, because the segmentation object of the coarse layer is compared with the data saved in the database, more than three feature points (which are recognized as having a sufficiently close geometry relationship) must be recognized for the same object. In contrast, a segmented object is recognized as an object different from those in the database.

Conclusion
In this study, we proposed a segmentation algorithm that is suitable for the recognition of most household objects and the construction of their models. In the proposed hierarchical model, the coarse layer can roughly segment the foreground object and background, and then the GrabCut algorithm is used in the fine layer to drastically reduce the amount of segmentation information and then segment the correct object. The proposed algorithm can segment objects in an environment with a suitable distance and appropriate size, and the segmented object information can be used in the modeling of back-end images. In experiments, the average accuracy rate of the proposed algorithm reached 93.05%. As compared with the algorithm of manual object segmentation with the SIFT algorithm, although the proposed algorithm has a lower recognition accuracy rate, it has the advantage of automated recognition, which can greatly reduce the recognition time and increase the range of applications of robots.