Automatic 3D-reconstruction of buildings using highly re-solving video sequences

Under taken by:

  • Berlin Univeristy of Technology
    Institute of Computer Vision & Remote Sensing


  • Prof.Dr.-Ing. Olaf Hellwich    CV (PDF)
     

    Dr.-Ing. Volker Rodehorst    CV (PDF)
     

Summary

We propose a fully automated prototype system to refine building models of an urban 3D GIS using terrestrial image data. Digital video sequences contain a high potential for such photo-grammetric applications which is presently not fully used. Projective geometry provides an effective mathematical framework to obtain geometrically precise information from partially calibrated cameras with varying parameters. Presently, it is being combined with object re-construction methods improving its performance in automation.

The main novelty is an intensive integration of spatial and temporal image matching as well as geometric and reflectance information. The photogrammetric model is partially recon-structed from the neighboring images of a trifocal camera system, partially from the preceding and following images of the video sequence. The observed buildings consist of planes, poly-hedrons and freely formed surfaces. We use generic models of surface assemblies together with particular scene constraints to extract, align and classify the polygonal primitives. Addi-tionally, the knowledge-driven object extraction is used to support the image orientation and probably to replace control information.

The approach is designed for off-line processing of recorded digital video, because the com-putational requirements to deal with hand-held markerless video streams exceed the capabili-ties of today real-time systems. We want to develop a mathematically homogeneous frame-work which incorporates photogrammetric model generation, temporal tracking, and reflec-tance analysis as well as object extraction and classification. Such systematic model genera-tion from several trifocal views gives a digital video system its full potential.

Goals

Some interesting applications of urban 3D GIS require a level of detail, which is currently not available using airborne data. Therefore, an automatic approach to acquire and/or refine build-ings from terrestrial image data is proposed.

Digital video cameras provide dense sequences of images. These contain a high potential for photogrammetric application which is presently not fully used. Projective geometry provides an effective mathematical framework to obtain geometrically precise information from par-tially calibrated cameras with varying parameters. Presently, it is being combined with object reconstruction methods improving its performance in automation and accuracy.

The main novelty is an intensive integration of feature extraction, image matching, orientation for video sequences, as well as modeling of surfaces with their reflectance characteristics. A mathematical optimization framework aiming at correct output parameters as well as high computational speed, will combine partially calibrated and relatively oriented trifocal image geometry with temporal tracking in video sequences and estimation of surface geometry and reflectance to generate a photogrammetric model. In this scenario, the photogrammetric model is partially reconstructed from the neighboring images of the triple, partially from the preceding and following images of the sequence.

Due to two facts the trifocal video system allows generating a reliable photogrammetric model for each image triple with little computational effort. Firstly, each candidate triple of homologous points has a high potential to be correct as the matching between images of the triple is stabilized by a tracking approach for points in each of the video sequences. Secondly, the trifocal tensor allows checking each triple based on the relative orientation of the cameras. Thus, the system basically acquires a three-dimensional image which is used in a tracking procedure. The system should be comparable to laser scanning but much faster provided that the correspondence problem is solved reliably.

The resulting photogrammetric models are combined as a metric model of the scene. Accord-ing to a generic object model, buildings consist of planes, polyhedrons and freely formed sur-faces. At this occasion planes and polyhedrons are extracted from the data constituting ge-neric features which can be combined to objects using knowledge-driven image analysis ap-proaches. Those are extracted based on a fast approach previously used for image segmenta-tion but here applied to the data in three-dimensional photogrammetric model space resulting in a classification of generic surface patches including those types and classes mentioned above. On the one hand this allows making use of image segmentation in the tracking proce-dure, helping to select views onto buildings with no or only few occlusions. On the other hand a mathematically homogeneous framework incorporates photogrammetric model generation, and temporal tracking as well as object extraction and classification.

Such systematic generation of a photogrammetric model from several trifocal views gives a digital video system its full potential. An important question with respect to this is how much control information can be replaced by knowledge-driven object extraction.

We develop a working, fully automated prototype system to recover several buildings. A software package allowing the automatic generation of a 3D model including the specific fea-tures mentioned above will be made publicly available. The results will be evaluated with respect to varying building types and geometric accuracies achieved.

Practically useful results can only be derived when the accuracy achieved fulfills photogrammetric standards. At the same time, only an automatic processing would ensure to make use of the full potential of video sequences. However, the computational requirements to deal with hand-held markerless video streams exceed the capabilities of real-time systems. There-fore, the proposed approach is designed for off-line processing of recorded digital video. In the second phase of the project, preliminary real-time processing and display of acquired sur-faces on a head-mounted display for completeness control may be considered.

The approach proposed has a more generic potential than to acquire building models for GIS only. Fulfilling photogrammetric accuracy and being able to model surfaces with various re-flectance properties is generic from a methodological point of view and allows various appli-cations. Considering this the project generally contributes to object reconstruction from im-agery.

Work schedule

Here we subdivide our approach into several parts. Yet finally we show that these tasks have to be solved simultaneously in order to obtain optimum results.

Part A: Sensor Preparation and Image Acquisition

The department's trifocal video camera system will be used for the investigations (cf. section 2.2). The image sequences should be acquired from positions extending completely around one or more buildings. For the mobile outdoor use in spatially extended, complex environ-ments with uncontrolled lighting conditions some modifications and extensions are required. From a practical point of view, the portable notebooks do not permit a transfer rate to save all the video data on a harddisk in real-time, so we have to use multiple notebooks (one for each camera). In contrast to our stationary solution we now have to solve a synchronization prob-lem and to change the power supplies with batteries. Depending on new insights and tech-niques permanent improvements of the prototype system are planned. Summary of the tasks:

  • Sensor preparation for mobile outdoor use
  • Acquisition of dense image sequences using the trifocal video camera system.

Part B: Integration of Stereo, Motion and Reflectance

After the acquisition of suitable data we extract robust feature points with a modified version of the FÖRSTNER-operator [ROD04] to search for corresponding candidates. In stereo vision the matching task is most challenging for wide-baseline constellations. Here, it is replaced by a computationally efficient integration of spatial stereo using relatively-oriented image triples and tracking in image sequences using temporal motion. Both methods take advantage of nar-row baselines which may be extended by concatenating the matched features in the sequence over time.

Points in successive images are tracked with a 3D motion model. Assuming that the scene is stationary (rigid body constraint) and the motion is smooth (compared to the video rate) the disparity distributions may be estimated with a KALMAN-filter, which is supposedly robust to occlusions. All feature locations, camera positions, viewing directions, velocities and accel-erations (rotational and translational) can be predicted by an extended KALMAN-filter using the estimated 3D locations of the points from two successive stereo results. The algorithm is iterative, using the 3D motion parameters from the KALMAN-filter to re-estimate the stereo correspondences. The iteration procedure stops when a global cost function is minimized.

Man-made objects such as buildings investigated in this project comparatively often show non-lambertian reflectance characteristics, e.g. specular object parts. In our approach we want to detect critical surface reflectance in order to avoid negative influences on the estimation of camera orientation and motion parameters.

Non-rigid objects may be detected from conflicts between the stereo and motion results. For the refinement of architectural models these moving parts should be eliminated. Summary of the tasks:

  • Fusion of the algorithms for spatial trinocular stereo matching, temporal motion track-ing and the reflectance characteristics of surface patches.
  • Detection and elimination of non-rigid objects

Part C: Image Orientation and Camera Calibration

This part is dedicated to the solution of the presently still existing difficulties of the evaluation of long video sequences. The KALMAN-filter will be used to estimate calibration and orienta-tion parameters simultaneously. The image orientation results are used to support the match-ing task again.

There are many theoretical and practical challenges to use video sequences from non-metric cameras in the framework of the terrestrial bundle solution, i.e. for photogrammetric pur-poses. Until now, the implemented methods are global and require that all feature points are seen in all views. The stitching of views should be realized with an incremental algorithm to handle long image sequences and to reduce the computational costs.

We selected robust versions of the geometric and numerical algorithms, but still have difficul-ties with ill-conditioned or degenerate data. Critical configurations for sets of six points may occur if the projection center of the camera remains on a certain quadratic surface [MAY98, HAR04]. We want to extend the estimation of the trifocal tensor in geometrical situations where the linear algorithm is degenerate.

In sequences spanning distances over several buildings errors are accumulated over time. The expected drift can only be removed using control information. The automatic recognition of an urban GIS contents (i.e. buildings) provide such additional information. This follows the traditional geodetic procedure to first establish an outer reference framework and then to den-sify information inside of the framework using large scale data. Therefore, the project's re-sults are interfacing to the other projects of the bundle proposal.

In the first period of the project the initialization of a first starting position has to be identified by a human operator. To make the system fully automated, the trifocal sensor may be ex-tended in the second period with a simple instrumentation for global positioning (GPS) and heading information (IMU) providing global control information.

Summary of the tasks:

  • Implementation of an incremental view stitching algorithm to handle long sequences
  • Extension of the structure-from-motion approach
  • Automatic matching of GIS control information with the reconstructed point clouds to prevent drift errors

Part D: Knowledge-Driven Surface Patch Detection, Extraction, and Classification

We assume that urban scenes exhibit some regular geometric structure. From the 3D points located on the surface of the object of interest (i.e. windows, doors, ledges and balconies), planes, polyhedrons and freely formed surfaces are generated. We use generic models to-gether with particular scene constraints to fit the spatial data and align the geometric primi-tives. An efficient technique involving sweeping of polygonal primitives is presented in [WER02].

The question whether an object is to be approximated by planes, polyhedrons or freely formed surfaces will be answered with the help of a minimum description length (MDL) classifica-tion criterion. The polyhedral object recognition may be improved with the statistical frame-work of uncertain projective geometry from HEUEL [HEU2004].

For building reconstruction suitable parts of the point clouds in object space have to be selected which are not influenced by occlusions. This selection can be based on image segmentation as well as spatial reasoning in 3D space. This step has a strong relation with the preceding tracking [FOW04].

An important part of the investigation will be to answer the question how intensively sensor orientation, surface reconstruction and surface classification can be integrated into a simulta-neous estimation procedure using geometric and radiometric information components and handling redundancy. As the image data is determined by object geometry and reflectance the estimation of both components regularly leads to an ill-posed problem. As the high quantity of images acquired here usually allows reconstructing object geometry using multi-view ste-reo based on matched point features, surface reflectance characteristics can be estimated. The strict estimation of object surface reflectance would finally allow to systematically acquire bidirectional reflectance distribution function (BRDF) parameters of object surfaces. Here, we will develop an estimation scheme with the general potential to achieve this goal, yet we will not emphasize this topic by systematic data acquisition, i.e. reasonable assumption will be made such as lambertian, specular or mixed reflectance characteristics.

Summary of the tasks:

  • Generation of planes and polyhedrons
  • Detection and fitting of freely formed surfaces
  • Classification of surfaces using geometric and radiometric information
  • Automatic selection of image triples suitable for reconstruction

Part E: Integration and Evaluation

If large parts of an image sequence are not suitable for reconstruction (i.e. influenced by blur-ring and occlusions) a wide-baseline approach may be used to span the gap. We are interested to evaluate different implementations, i.e. first results of WOLFGANG FÖRSTNER and HELMUT MAYER in comparison to our own approach [ROD04].

The automated extraction focusing on buildings interacts with the project of WOLFGANG FÖRSTNER. The observed buildings consist of planes and polyhedrons. Our extracted polygo-nal primitives should be useful for the group of WOLFGANG FÖRSTNER.

The acquisition of 3D information from image sequences interacts with the project of HELMUT MAYER. Our classified results for vegetation surfaces may be useful for their group. On the other hand, the knowledge about trees helps us to avoid problems with occlusions. To support our knowledge-driven object extraction task, the surface models of the reconstructed trees should also be integrated.

The acquisition and exchange of Chinese data will be done together with the colleagues of ZUXUN ZHANG et. al. They want to develop extensions in the form of the orientation based on lines, which may be helpful for our extraction of planes and polyhedrons. Additionally, our developed techniques and results on extracted and classified surface assemblies should be useful for the group of ZUXUN ZHANG et. al.

Finally, we will acquire some ground-truth data for evaluation with respect to varying build-ing types and geometric accuracies achieved.

Summary of the tasks:

  • Evaluation and integration of wide baseline matching
  • Integration of tree model information
  • Generation of ground-truth data and evaluation of the results