A JOINT EFFORT OF SPEEDED-UP ROBUST FEATURES ALGORITHM AND A DISPARITY-BASED MODEL FOR 3D INDOOR MAPPING USING RGB-D DATA

: In this paper


Introduction
3D mapping of indoor environments is an important task in many engineering applications, such as, simultaneous localization and mapping systems (SLAM), surveillance and emergency managements, navigation, positioning, robotics, forensics, virtual tours, crisis management, modeling, infrastructure inspections, urban design and others.Basically, the most important existing 3D mapping of indoor environments method using RGB-D data is based on three main steps (Henry et al, 2012): 1) pairwise registration; 2) loop closure detection; 3) global optimization.The pairwise registration task aims to estimate the transformation parameters between pairs of point clouds.The most popular approach to registering point clouds is the iterative closest point (ICP) algorithm (Besl and McKay, 1992).Pairwise registration task is prone to error due to the random error of individual points, leading to incorrectness in the 3D map, as described by Khoshelham et al. (2013).To handle with registration errors the loop closure detection should be used (Du et al., 2011).Loop closure corresponds to the global adjustemnt for simultaneous refinement of all the transformation parameters in a sequence.Once transformation parameters from registration task give us some constraints (poses between point clouds, adjacents or not), we can represent these constraints using a graph structure.Basically, there are three main graph optimization approaches proposed in the literature: tree-based network optimizer (TORO), idealized by Grisetti et al. (2007); general framework for graph-based optimization (G2O), proposed by Kümmerle et al. (2011); and sparse bundle adjustment (SBA), proposed by Lourakis and Argyros (2009).In this paper, the graph optimization of the complete data sequence is performed using the graph-based optimization to minimize the registration errors.
Nowadays, RGB-D sensors are quite useful solution to build colored 3D indoor maps because it can exploit both the visual and the depth information.Its advantages compared with LASER scanning sensors are the lightweight, the low cost and it is much more flexible (Dos Santos et al., 2016).In this paper we propose a method for 3D indoor mapping using RGB-D data.The contribution of our proposed method is two-fold.First, we propose a joint effort of speed-up Bulletin of Geodetic Sciences, 24(3): 351-366, Jul-Sept, 2018 robust features (Bay et al., 2008) and a disparity-based model (Dos Santos et al., 2016) to include additional constraints in the graph describing the coarse-to-fine registration between RGB-D data.Second, we investigate the variance-covariance matrix that describes the uncertainty of the transformation parameters (3D rotation and 3D translation) to weight the graph-based optimization.
The paper proceeds with the related work in the Section 2. In Section 3, the proposed method for 3D indoor mapping is described.Experimental evaluation of the method is presented in Section 4. The paper concludes with a discussion in Section 5.

Related Work
Basically, 3D indoor mapping method is based on two key problems: pairwise registration and global registration with loop closure.This approach uses salient feature surfaces, such as: points; planes; lines or freeform surfaces.
The method requires an accurate initial transformation and high overlap area between the point clouds and in the most situations it is time-consuming.

Combined depth and intensity values
This method integrates both the depth and the intensity information in 3D registration.
Low resolution of intensity images delivered by both LASER scanning.Range devices provide inaccurate feature location and it is highly time-consuming.

Global registration with loop closure
Loop closure ensures that new constraints can be inserted in the graph optimization.The global registration minimizes the registration errors.
In most cases the uncertainty of the transformation parameters is not employed to weight the optimization process.Usually, the refinement of the pairwise registration not always held.
We will cover the main related work about the techniques in our study by grouping them into three main different categories: (1) point cloud registration: One of the most popular method is the ICP (iterative closest point) algorithm for registration of 3D shapes developed by Besl and McKay (1992).According Henry et al. (2012) "ICP has been shown to be effective when the two point clouds are already nearly aligned".(2) Combined depth and intensity values: There are many combined depth and intensity extraction algorithms proposed in recent years.Al-Manasir and Fraser (2006) presented a method that uses bundle adjustment of images taken by a digital camera to register the LASER point cloud data rather than ICP algorithm.Another interesting method was given in Barnea and Filin (2008) which suggested the integration of both the depth values and the intensity information in pairwise registration process.(3) Global registration with loop closure: Loop closure is a technique used to recognize when the sensor has returned to a previously visited location.It is useful to insert constraints into graph optimization task.Henry et al. (2012) used a subset of the registered frames to keep the graph relatively sparse.The loop closure is detected based on the number of visual features stablished.In other words, a loop closure is detected if sufficient amount of correspondences is correctly determined.This is done, by means of a joint effort between RGB-D data and the ICP algorithm.Once loop closure is detected, the tree-based network optimizer is activated.In Chow et al. (2014) the ICP algorithm is carried out in a model-to-scene fashion to handle the loop closure problem.The authors used the sparse bundle adjustment to obtain a consistent 3D indoor map.A summarized description of the advantages and disadvantages of the before mentioned methods is presented in Table 1.Note that our objective is both the pairwise registration and the global registration tasks.We first propose a coarse-to-fine registration task using a joint optimization algorithm combining visual features to carry out a coarse registration and a disparity-based model to realize the fine registration step, instead of only apply a coarse registration step as done by Henry et al. (2012).Second, we use the variance-covariance matrix that describes the uncertainty of the transformation parameters to weight the graph-based optimization task.In our method, the variance-covariance matrix is obtained by minimizing disparity-to-plane distances using iterative least-squares estimator, instead of accept it as identity matrix or using the Cholesky solver such as introduced by Kümmerle et al. (2011).

Method
This section explains the proposed method for 3D mapping of indoor environments using RGB-D data.In particular it combines a speed-up robust features algorithm with a disparity-to-plane model to find refined transformation parameters.Then, new constraints are inserted using the loop closure procedure and the registration errors are minimized using a graph-based optimizer.The result of this framework is a globaly consistent 3D indoor map.The Figure 1 presents the generic structure of our proposed method.Figure 1 shows the proposed method for 3D mapping of indoor environments using RGB-D data.Our method is divided in three main parts: 1) pairwise registration; 2) loop closure detection; 3) global registration.Given RGB-D data is obtained with Kinect sensor.The before mentioned tasks are described, as follows.

Pairwise registration
In this work, pairwise registration problem aims to estimate the transformation parameters between pairs of RGB-D frames.As before mentioned, a coarse-to-fine registration task using a joint optimization algorithm combining visual features to compute initial approximation of the transformation parameters (3D rotation and 3D translation) and a disparity-based model to realize the fine registration step is proposed.Herein, the main idea is performing the pairwise registration using RGB-D data.This task is separated in six steps, as shows Figure 1.First, keypoints (visual features) should be detected and their correspondences obtained.The scale invariant feature transform (SIFT) algorithm, developed by Lowe (2004) is one of the most used feature point detection and feature descriptor algorithm.However, is relatively time consuming in terms of reliability and robustness (Wu et al., 2013).It is divided in three stages: a) keypoints extraction and detection; b) description; and 3) matching.The general framework of SURF algorithm can be found in Bay et al. (2008).Basically, the focus is producing a feature descriptor that allows quick and highly discriminatory assessments with other features.Second, the outliers are detected using the fundamental matrix () and RANSAC algorithm.The RANSAC computes  using the selected the normalized eight-point algorithm (Hartley and Zisserman, 2003).Then, the fitness for all points putatively matched is computed.If the fitness is better than initial , replace  with fitness and the number of random trials () is updated for every iteration in the RANSAC algorithm (Fischler and Bolles, 1981).The results it is a set of matched keypoints.Third, in order to associate the keypoints with their corresponding depth values we used the method proposed by Dos Santos et al. (2016).Firstly, depth values are computed using the formulation introduced in Khoshelham and Elberink (2012), as follows: (1) intercept of the line, respectively and  the denormalized disparity value.After, the correct associated 3D point is search using the epipolar geometry concept.Assuming that the shift of the depth values is sufficient to align the depth image with the RGB frame, as described in Henry et al. (2012),   and   coordinates using RGB-D data are computed as follows (Khoshelham and Elberink, 2012): where   ,   denotes coordinates for each point on surface,  0 ,  0 represents the principal point coordinates,   ,   denotes the lens distortion,    ,    are the 2D coordinates for each pixel in current infrared image (IR image) and  denotes the calibrated focal distance of infrared camera (IR camera).The before mentioned task should be realized for each pair of RGB-D frame.The result is a set of corresponding 3D points.
Fourth, to estimate the initial approximation of the transformation parameters for the fine registration task we propose a coarse point-based alignment procedure.Then, since exists a set of three or more corresponding 3D points, a 3D rigid transformation is used to carried out the coarse registration task.The single-cost function that represents the 3D rigid transformation, which minimize the point-to-point error () is expressed as: where  , ,  , denotes the 3D coordinates of point  in reference RGB-D frame () and a search RGB-D frame (), respectively,  denotes the number of corresponding 3D points used for each pair of RGB-D frame,  denotes the 3D rotation matrix and  is the 3D translation vector.The least-squares solution for  and  can be obtained:  ̂= (  ) −1  T , where A represents the Jacobian of the equation with respect to the observations, w is obtained by the observation and the approximate values of the unknowns.This arrangement gives to the system an initial approximation for the fine registration task.In the fifth step of the proposed pairwise registration method, we segmented pairs of RGB images to associate the centroid of the segmented regions to the depth values extracted from Depth images.Figure 2 illustrates the main steps to associate the centroid of the segmented regions in the RGB images to the depth values extracted from Depth images.According Khoshelham et al. (2013) the Kinect sensor captures Depth and colour (or RGB) images at a rate of 20~30 frames per second, which can be combined into a coloured point cloud, also referred to as RGB-D data (or frame).In Figure 2, the regions of interest (ROI) are segmented in a pair of RGB images using the simple linear iterative clustering (SLIC) algorithm proposed by Achanta et al. (2012).Basically, SLIC performs a local clustering of pixels in 5-D space defined by values of the nonlinear transformation of RGB frame.It samples  regularly spaced cluster centers and moves them to seed locations corresponding to the lowest gradient position in a 3×3 neighbourhood.Each pixel in the image is associated with the nearest cluster centroid whose search area overlaps this pixel.For all the pixels associated with the nearest cluster centroid, a new centroid is computed as the average vector of all the pixels belonging to the cluster.At the end of this process, a few stray labels may remain.For each pair of RGB images the centroids ′(  ,   )  and ′(  ,   )  are computed by means of the average position of all the point in the segmented ROI for matching process.Thus, the correspondences are obtained using the approximate nearest neighbours (ANN) algorithm proposed by Muja and Lowe (2009).After, the 3D coordinates   = [      ]  for each segmented ROI is computed using Equations ( 1)-(3).Third, the segmented ROI from reference Depth frame () are fitted to obtain the unit normal vector ().Once the plane is represented as a 3D point   and  (normal vector), and the distance from a point   ∈ ROI to the plane is defined as   = (  −   ).The solution for  is given by analysing the eigenvalues and eigenvectors of the matrix  ∈ ℝ 33 of ROI, as follows: For  we determine its eigenvalues λ  ∈ ℝ and their corresponding eigenvectors   .For 0 ≤ λ 0 ≤ λ 1 ≤ λ 2 , the eigenvector  0 corresponding to the eigenvalue λ 0 that represents an approximation of + = {  ,   ,   } or -.Then, the centroid  = [         ] T is obtained by means of the average position of all the points in each corresponding segmented ROI from search Depth frame ().Finally,  is associated to the disparity value extracted from their corresponding depth value in the Depth search image.Note that the red surfaces denote segmented ROIs in the RGB images, green points denote the computed centroid from ROI in the RGB images and blue points represent the computed centroid from ROI in the Depth frames, which 3D points are determined using Equations ( 1)-(3).The depth values are extracted from Depth images and there is a centroid associated to them.Finally, we refine the transformation parameters (, ) using the disparity-to-plane model introduced by Dos Santos et al. (2016).Let   be the set of centroids from segmented ROI in the reference Depth frame and   the set of centroids from segmented ROI in the search Depth frame, the 3D rigid transformation () between them consists of a 3D rotation and a 3D translation (transformation parameters).Conveniently, these transformation parameters are combined in a transformation matrix  of homogenous coordinates: where  * and  * denotes the refined 3D rotation and 3D translation, respectively, and ′ = [         1]  is the homogeneous representation of the centroid in 3D space.
For a set of centroids that lie on a segmented ROI, the disparity-to-plane model used to refine  and  can be written as: where  = ( T , −) T is the homogeneous representation of the plane defined by a normal vector  unit length and its perpendicular distance  from origin.Substituting Equation ( 6) in ( 7) give us: where   ,   ,   denote the unit normal vector.
Substituting equations ( 1)-( 3) in ( 8), the disparity-to-plane model can be expressed as (Dos where  11 …  33 represent the elements of  * and   ,   ,   are the elements of  * .For a set of three or more non-parallel segmented ROI corresponding to the centroid ′ previously established, the solution for  * and  * can be obtained by means of least-squares criteria:  ̂= (   − ) −1  T  − , where  =   , A represents the Jacobian of the condition equation with respect to the unknown parameters, B represents the Jacobian of the condition equation Geodetic Sciences, 24(3): 351-366, Jul-Sept, 2018 with respect to the observations, w is obtained by evaluating the condition equation with the observation and the approximate values of the unknowns.Note that  and  are used as initial approximation to estimate  * and  * into fine registration task.Once the coarse-to-fine registration step accumulates errors, loop closure task should be actived.This task corresponds to the step seven of the proposed method, as shows Figure 1.

Loop closure detection
This task is formalized with the help of a graph-based optimizer.A graph is an ordered pair (, ) comprising a set  of nodes together with a set  of edges.In this work, nodes are parameterized by 3D rotation and 3D translation components, which corresponds to the refined transformation parameters   (  * ,   * ) , for all  ∈ [1, ] , and edges represents constraints between the nodes.Loop closure ensures that new constraints can be inserted in the graph optimization.As described, loop closure corresponds to the global adjustemnt for simultaneous refinement of all the transformation parameters in a sequence.In this work, we used the loop closure method proposed by Hogman (2012).According Hogman ( 2012), the loop closure aims to detect revisited locations checking if the key frames matches with some of the previous one.Then, a transformation can be estimated between the current frame and key frame.Consequently, a new constraint can be inserted from it.Finally, the cumulated error of (, ) can be minimized.However, if the time required for check the similarities between the key frame and current frame is not fast, the loop closure detection task can be highly time consuming.In order to improve the computational loop closure detection cost it was used a slide windows strategy proposed by Hogman (2012).In Figure 3a, the detection of the loop closure is done by comparing the most recent frames ( 5 ) with the previous four frames ( 0 , … ,  3 ) belonging to the slide window, where the last frame ( 4 ) is ignored to improve the computational cost.The slide window is defined with a fixed size of four frames in the past.In other words, for the current frame  5 , the frames  0 −  3 are checked.Then, the loop closure is inserted between  5 and the old frame with greater similarity.The similarity value is determined by a probability density function, as described in Hogman (2012).In this case, an edge () should be added to the graph.Thus, the graph is optimized using G2O graph-based method.For the next frame  6 , the window slides one step forward and the frames  1 −  4 are checked, as illustrated in Figure 3b.These tasks are successively executed until no more key frames exist, as depicted in Figure 3c.Then, whole pose graph can be optimized using a pose graph network optimizer as proposed by Grisseti et al. (2007).The graph to be optimized is written as: where    represents the nodes formed by consecutive transformations and  −1  denotes the edges created by relative transformations between the nodes.

Graph optimization
In this work, G2O graph-based optimizer is used for global adjustemnt of all the transformation parameters in a sequence.In G2O method, the directed edges are added in  based on number of correct corresponding points and   (  * ,   * ), such as described in Grisetti et al. (2007).To verify the consistency of the optimization problem an analyse of error function (  ,   ) should be executed, once an edge is characterized for both (  ,   ) and its variance-covariance matrix.According Khoshelham and Elberink (2012) the accuracy of depth image is not stable and also depends on the structure of the environment.For example, on some materials the reflection affects the measurement depth.Thus, is essential includes a metric to ensure that the constraint to be comprised in the loop closure detection task represents the real uncertainty of the transformation parameters.In Kümmerle et al. (2011) the variance-covariance matrix that describes the uncertainty of the transformation parameters to weight the graph-based optimization task is used as identity matrix or, usually, computed by Cholesky solver.The Cholesky solver only carried out the inverse of   , being  the Jacobian of the condition equation with respect to the unknown parameters.Note that Cholesky solver ignores the posteriori variance factor ( ̂2 ) and the number of redundant observations in the system (r).To compensate this deficiency, we introduced the variance-covariance matrix  ̂ of the refined transformation parameters (  * ,   * ) in the graph optimization process.In this work,  ̂ is obtained jointly with the fine registration task, as follows: where  ̂02 =     , A represents the Jacobian of the condition equation with respect to the unknown parameters,  =   with B denoting the Jacobian of the condition equation with respect to the observations and  represents the residual vector.Note that, we introduced  ̂02 computed with respect to the residual vector and the redundancy, as well as the Jacobian of the condition equation with respect to the observations.

Experiments and Results
To demonstrate and evaluate the effectiveness of our proposed method three experiments were conducted.Firstly, we calibrate the Kinect device using the Matlab camera calibration toolbox (Bouguet, 2004).The Bouguet's method improved the technique originally proposed by Zhang (2000).Basically, the well-known photogrammetric bundle adjustment with self-calibration method is used to estimate the interior orientation parameters (IOPs), the relative translation and rotation between the RGB and IR sensors (∆,∆, ∆, ∆, ∆, ∆).The lens distortions are modeled using the Brown model (Brown, 1971).Table 2 presents the estimated IOPs and the relative translation and rotation between the IR and RGB sensors.Second, a sequence of handled data was captured in order to test the suitability of the devised method for 3D indoor mapping applications.A person carried the Kinect sensor by hand leaving a certain place, turned around and then returning to the same position.A total of three indoor scenarios were captured and processed.Table 3 summarizes the trajectory of the sensor obtained after the graph optimization, the number of nodes (  ) corresponding to the refined transformation parameters (  * ,   * ) and the number of edges () representing the constraints added in the graph by loop closure detection task.
The initial estimative of 2,6 cm for the baseline distances between the IR camera and RGB camera are close to the initial approximations obtained manually with micrometer equipment.In Table 2, note that the magnitude of radial distortions is larger in the RGB images.    is computed.The   is defined as the absolute difference between a horizontal ground distance and the corresponding distance measured in 3D map after the graph optimization.The horizontal ground distance is measured with a tape and the corresponding distance is manually measured in the 3D indoor map obtained after graph optimization.The highest error was found for scenario A. Note that the error of the graph optimized corresponds to 1,5% over the distance travelled by the sensor in the experiments A and C. For example, an error of 11,97 cm in the graph corresponds to a positional imprecision around 1,5% at the distance of 9 m travelled by sensor.The 3D indoor map generated with the proposed method for scenarios A and B are displayed in Figure 5 and Figure 6.As can be observed in Figure 6a, a laptop in the down left corner indicated by a red square box and all the stuffs in the living room are not adequately modeled before graph optimization.After graph optimization the scenario B is consistently optimized and the before mentioned stuffs can be clearly visualized, as presented in Figure 6b.The evaluated scenarios A and B are mainly composed of doors, kitchen and living room stuffs.However, we evaluated the proposed method for a rather extreme case, such as, the scenario C that is essentially composed of flat walls, as depicted in Figure 7.Note that after the graph optimization the accumulated errors (see black narrow region in Figure 7a) are practically attenuated achieving a consistent 3D indoor map, as visually verified in Figure 7b and Figure 7c.

Figure 1 :
Figure 1: Generic structure of the proposed method.

Figure 2 :
Figure 2: Description of the tasks to associate depth values with the centroid of the segmented region in the pair of RGB images.

Figure 3 :
Figure 3: Slide window strategy proposed by Hogman (2012).(a) Loop closure detection between current frame and the old four frames.(b) Window slides step forward representation to check frames.(c) Loop closure detection process until no more key frames exist.
The graph optimization aims to minimize the coarse-to-fine registration errors.To verify the consistency of the optimization problem the error function (  ,   ) is analysed.According Grisseti et al. (2007)  is a function that computes the difference between the relative position of the two nodes (  and   ), that represents   seen in the frame of   , and the ground observation (  ) manually measured using a tape.In practice, (  ,   ) is computed through the difference between the obtained trajectory of the sensor and   .Figure4shows (  ,   ) for scenarios A-C.In Figure4, (  ,   ) values closest to zero mean that the nodes   and   correspond to the ground observation.Note that, the largest magnitude of (  ,   ) was 15,9 cm found in the scenario B and the smallest magnitude of (  ,   ) was 2,7 cm for the scenario C. Once an edge is characterized for both (  ,   ) and the proposed  ̂ matrix, when a loop closure is detected a constraint represented by  ̂ between the nodes   and   is added to weight the graph optimization task.As expected, when a joint effort of SURF algorithm and disparity-based model is executed and the loop closure is detected small (  ,   ) values are achieved, as showed the scenario B. Usually, with textureless regions (scenarios A and C) the disparity-based model does not work appropriately and the  ̂ cannot be computed.Then, the algorithm automatically attribute to  ̂ the identity matrix.Consequently, are achieved high (  ,   ) values, as show the peaks in Figure4.

Figure 5 :
Figure 5: 3D indoor map for scenario A. (a) Scenario A before graph optimization, (b) Scenario A after graph optimization, (c) top view of the Scenario A before graph optimization and (d) top view of the Scenario A after graph optimization.

Figure 6 :
Figure 6: 3D indoor map for scenario B. (a) Scenario B before graph optimization, (b) Scenario B after graph optimization.

Figure 7 :
Figure 7: 3D indoor environment for scenario C. (a) top view of 3D scenario before graph optimization, (b) top view of 3D scenario after graph optimization and (c) a perspective view of 3D indoor map.

Table 1 :
Summarized description of the advantages and disadvantages of both the pairwise registration and graph optimization methods.

Table 2 :
Interior orientation parameters, boresight and lever arm misalignment.

Table 3 :
The number of nodes, edges and the trajectory of the sensor measured by the proposed method.
Third, in order to evaluate the global error of the graph optimization the root mean square Bulletin ofGeodetic Sciences, 24(3): 351-366, Jul-Sept, 2018