Font Size: a A A

Research On Point Cloud Semantic Segmentation Technology Based On Cross-Modal Learning

Posted on:2024-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:S X ZhaoFull Text:PDF
GTID:2568307151953429Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Point clouds,voxels,and depth maps are important ways to represent 3D spatial structure information.Through modal fusion,their respective advantages can be fully utilized to achieve more accurate 3D scene analysis.Point cloud semantic segmentation technology is widely used in autonomous driving,intelligent robots and other fields,and is one of the hot research directions currently attracting attention.This thesis focuses on using cross-modal learning techniques to solve the challenges of semantic segmentation of point clouds.The main work and innovations include:(1)Aiming at the difficulty of extracting key features from point cloud and low segmentation accuracy,a point cloud voxelization and Transformer point cloud semantic segmentation network FVT-Net(Fast Voxel Transformer)is proposed.Design 3DVoxel_Hash_Octree to filter blank voxels,establish voxel spatial position index and index mapping with internal point cloud subsets,and reduce the complexity of multi-scale neighborhood search and feature embedding.The voxel attention module and point cloud-aware voxelized sampling module in the network enable cross-modal learning of points and voxels.At the same time,the network uses the voxel spatial position index as the position embedding of the voxel attention module,which preserves the initial spatial information of the voxel.FVT-Net can effectively handle point cloud classification and segmentation problems in complex scenes.(2)Aiming at the problem of difficult training caused by sparse point cloud data set,the projection data set of Model Net40 is constructed and a cross-modal contrast learning network 3D-Clip based on the two-dimensional image domain pre-training model CLIP(Contrastive Language-Image Pre-training)is proposed.In this network,the voxel perspective projection module is designed to convert point cloud data into image data,so as to realize knowledge transfer between point cloud,image and text,and to conduct zero-sample classification of point cloud data without any prior knowledge.In order to take full advantage of the complementarity between different modal data,designed FVTClip-Net,which integrates 3D-Clip and backbone network FVT-Net,and improves the point cloud classification performance by 4.5% compared with the baseline network Point Net.(3)To solve the problem that the existing point cloud semantic segmentation technology relies on manually annotated data sets,a self-supervised point cloud learning network FVTregular-Net based on cross-modal and contrast learning of point-image is designed.The network uses a Siamese neural network(twin neural network)architecture,including point cloud branches,image branches and cross-modal branches.In order to solve the feature collapse due to improper selection of positive and negative samples in self-supervised learning,regularization constraints are established on the feature vectors output by these three branches.FVTregular-Net is pre-trained on Shape Net and its rendering data set,and retains the encoder of the point cloud branch to handle downstream tasks,realizing the classification and segmentation of small sample point cloud data.
Keywords/Search Tags:point cloud semantic segmentation, cross-modal learning, Transformer, voxel, self-supervised learning
PDF Full Text Request
Related items