| According to the "Analysis Report on Production Safety Accidents(Incidents)of State Grid Corporation of China in 2021",interference from external factors such as construction vehicles,trees,bird damage,and foreign objects is one of the important causes of transmission line failures.Therefore,external target inspections under transmission lines are crucial for maintaining the safe and stable operation of the power grid.Among them,judging the distance relationship between external targets and transmission lines is an important part of the inspection work.With the rapid development of computer vision technology,unmanned inspection methods such as fixed inspection cameras and drone inspections are gradually replacing manual inspections.However,in the current unmanned inspection plans,fixed inspection camera plans mostly use depth cameras for ranging,and drone inspection plans mostly use LiDAR for ranging,which leads to high usage and maintenance costs for unmanned inspections.To address this problem,we introduce monocular depth estimation technology into unmanned inspection tasks in the power field,which can directly calculate the distance that meets the accuracy requirements using the collected images,reducing the dependence on ranging hardware such as depth cameras and LiDARs,and further reducing usage and maintenance costs.Currently,the mainstream monocular depth estimation algorithms mostly use encoderdecoder architectures based on convolutional neural networks.We analyze the shortcomings of mainstream methods in applying them to transmission line scenes and summarizes the challenges of monocular depth estimation in the current scene,namely,how to effectively avoid the problem of fine-grained information loss caused by the convolutional neural network structure expanding the receptive field through downsampling,how to effectively solve the problem of low prediction accuracy of distant small targets,and how to ensure the prediction consistency of the depth map while preserving the local detail information in the depth map.To address the challenges of monocular depth estimation in transmission line scenes,we use a Transformer architecture to model the depth prediction process in the image information processing module,ensuring the completeness of the image information processing by fully utilizing fine-grained information while guaranteeing the global receptive field;for distant small targets in practical scenes,we implement adaptive selection of multi-scale features in the adaptive patch embedding module,allowing the model to learn scale information more relevant to itself for different distant small targets to better complete the prediction of the depth map;to preserve the local detail information in the depth map and ensure the prediction consistency of the depth map,we design a multi-scale depth map fusion module to integrate the advantages of depth estimation in different resolutions of the model,combining the consistency advantage of predicted depth maps at low resolutions with the rich local detail information of predicted depth maps at high resolutions to improve the overall prediction accuracy of the model.We conducted detailed comparative experimental analysis and ablation experiments on the transmission line scene dataset.The experimental results show that compared with the baseline method,the model in this paper improves AbsRel and SiLog by at least 0.11 and 5.476,respectively.Currently,the algorithm has been applied in actual scenes and has achieved good prediction results. |