| 3D reconstruction from images has always been a classic computer vision problem.Both depth camera-based 3D reconstruction and traditional 3D reconstruction have their advantages and disadvantages.The 3D reconstruction based on the depth camera has poor reconstruction effect in the case of occlusion and shadow and it is difficult to apply to the3 D reconstruction of real-time scenes;the traditional 3D reconstruction is often not ideal due to the influence of weak texture and specular reflection on non-Lambertian surfaces.In recent years,with the rapid development of deep learning,the study of 3D reconstruction using deep learning has attracted more and more attention.Multi-view stereo(MVS)has also achieved tremendous progress in recent decades.It is successful and widely used in various applications,such as autonomous driving,robot navigation,remote sensing and movable cultural relics,etc.It also has theoretical research significance in the creation of smart cities,VR tourism,ancient architectural heritage protection and machine navigation and immeasurable application value.Aiming at the problem that the overall of the multi-view stereo(MVS)reconstruction effect is not ideal,this thesis studies the feature extraction module and cost volume regularization module in the multi-view stereo reconstruction process.Firstly,the feature pyramid network is used to extract deep features from the input source image and reference image,and an attention layer is added to each feature extraction module to capture the long-range dependencies of deep inference tasks;Then,the feature quantity of the reference frustum plane is constructed through the differentiable homograph transformation,which is used to construct the cost volume;Finally,a multi-layer U-Net network architecture is used to regularize the cost volume,and the edge information of the reference image is fused by regression operation to generate the final refined depth map.The test is carried out on the DTU(Technical University of Denmark)dataset.Compared with the MVSNet* method,the overall,accuracy and completeness indicators of this thesis are improved by 2.9%,5.4% and 0.4% respectively.The experimental results show that the network architecture proposed in this thesis has obtained the best results so far with MVSNet in terms of overall indicator,the completeness and accuracy indicators have been greatly improved,and a better 3D reconstruction effect map has been obtained,which proves that the effectiveness of the method in this thesis.(1)The network is based on the classic MVSNet model framework,and a feature pyramid network structure combined with attention mechanism is proposed for the feature extraction module.The algorithm uses both the rich spatial location information of low-level features in the image and the strong semantic information of top-level features,and then performs horizontal links,and obtains prediction results by fusing these different levels of features.The prediction results are achieved separately on each fused feature layer,and then an attention layer is added after each layer of prediction for deep inference.(2)In the cost volume regularization module,a multi-layer U-Net(MU-Net)network is proposed to down-sample the cost volume,and simultaneously extract context information and adjacent pixel information of different scales to filter the cost volume,and the edge information of the reference image is combined simultaneously through the regression operation to generate the final refined estimated depth map.(3)The visualization results on the Blended MVS datasets and the self-collected datasets demonstrate the generalization ability of our network. |