Home /Research /Deep learning for 3D vision
PERCEPTION

Deep learning for 3D vision

Yulan Guo, Hanyun Wang, Ronald Clark, Stefano Berretti, Mohammed Bennamoun

Year
2022
Citations
4

Abstract

With the rapid development of 3D imaging sensors, such as depth cameras and laser scanning systems, 3D data has become increasingly accessible. Meanwhile, the boost of various deep learning algorithms, such as convolutional neural networks and transformers, further increases the usability of 3D vision systems. Driven by these factors, 3D vision has become an emerging and core component for numerous applications, such as autonomous driving, augmented reality, virtual reality and robotics. Although remarkable progress has been achieved in this area during the last few years, there are still several challenges that need to be addressed, such as the noisy, sparse, and irregular nature of point clouds, the high cost to label 3D data and the necessity to integrate geometry-based and learning-based techniques. Besides, 3D data produced by different 3D imaging sensors (e.g. structured light, stereo, LiDAR and time-of-flight) can be highly different. It is, therefore, necessary to investigate general algorithms that can mitigate the domain gap between different types of 3D data. This special issue aims to collect and present the latest research development in learning-based 3D vision theories and their applications and to inspire future research in this area. In total, there are eight papers accepted for publication in this special issue through careful peer reviews and revisions. These accepted papers are broadly categorised into three topics, and the summary of each topic is given below. Han et al., in their paper ‘DEMVSNet: Denoising and Depth Inference for Unstructured Multi-View Stereo on Noised Images’, proposed a DEMVSNet to simultaneously address the depth estimation and image denoising problems for unstructured multi-view stereo. The multi-scales feature maps for each image are wrapped to construct cost volumes containing both the depth and RGB information through differentiable homography and Gaussian probability mapping. The cost volume regularisation module is then adopted to predict the probability of depth and RGB. To avoid overfitting in multi-task learning, the gradient normalisation algorithm is utilised to dynamically fine-tune the weights between the depth prediction task and the denoising task. To evaluate the performance of proposed DEMVSNet, a noisy Technical University of Denmark dataset is generated by adding Gaussian-Poisson noise to each image, and the experimental results demonstrate the superiority of DEMVSNet on both the denoising and multi-view stereo reconstruction tasks. Lin et al., in their paper ‘EAGAN: Event-Based Attention Generative Adversarial Networks for Optical Flow and Depth Estimation’, proposed an event-based attention generative adversarial network named EAGAN to simultaneously deal with optical flow and depth estimation based on monocular event camera. The generator of EAGAN is similar to U-net except that a transformer structure is introduced between the encoder and decoder. The position-coding features learnt from the transformer is added to features learnt from the encoding layer, which helps to capture the correlation between sequence information. The discriminator of EAGAN is based on a fully convolutional network and aims to distinguish whether the depth image or the optical flow image is generated by the generator. Experimental results conducted on the multi-vehicle stereo event camera dataset demonstrate the effectiveness of EAGAN on both the depth and optical flow estimation tasks. Gao et al., in their paper ‘Efficient 6D Object Pose Estimation based on Attentive Multi-Scale Contextual Information’, proposed an end-to-end 6D pose estimation network to utilise multi-scale contextual features learnt from two heterogeneous data. First, interesting objects are detected from an RGB-D image using an existing semantic segmentation method. Then, pixel-wise geometric and colour features are learnt from 3D point clouds and 2D images respectively. Next, three pixelwise feature attention mecha

Keywords

Computer scienceArtificial intelligenceDeep learningStereopsisRoboticsPoint cloudConvolutional neural networkLidarComputer visionAugmented reality

Related papers

Browse all PERCEPTION papers