| Ultrasound examination possesses characteristics such as high efficiency,convenience,and non-invasive safety,as well as the unique attribute of real-time imaging compared to other examinations.As a result,it plays an important role in the diagnosis and early screening of many diseases.Real-time segmentation and lesion recognition of ultrasound video data using AI’s powerful image analysis and computing capabilities can aid physicians in identifying lesions that are difficult to recognize with the naked eye,easing the pressure on experienced expert physicians.This approach is important in the preliminary diagnosis of diseases and guidance of biopsy,as well as in remote medical applications.Video object segmentation has been an extensively researched field.Whether it is unsupervised or semi-supervised,both require solutions to two key issues:how to represent video frames’ features and how to propagate segmentation masks using video continuity.Commonly-used image feature extraction networks,such as VGG and Resnet,are often employed to address the former.In terms of mask propagation,relevant studies circulate around optical flow,instance embeddings,temporal networks,and Transformer.Particularly,Transformer has recently been gradually applied to video object segmentation tasks,and has demonstrated good results.However,the existing methods are primarily tailored for real-world video data captured by cameras.For ultrasound videos,applying the above methods directly is ineffective due to fundamental differences in imaging principles between ultrasound imaging and visible light imaging.Therefore,this paper presents an end-to-end semi-supervised video object segmentation network based on ultrasound videos,whose main contributions are listed below:(1)We propose an end-to-end multi-stage network,EMNet,for ultrasound video object segmentation.The network comprises an image enhancement module,a mask inference module,and a refinement module.Learnable enhancement parameters are introduced in the image enhancement module,and a learnable image feature enhancement layer is designed to adaptively and effectively enhance the contrast between the target and background in ultrasound image frames.The mask inference module utilizes the Long Short-Term Memory Temporal Transformer (LSTT) module to achieve multi-object segmentation and mask propagation.To address the imprecise initial mask results,a refinement module is proposed,and a gating mechanism is designed to effectively integrate the initial and refined segmentation results.Lastly,an end-to-end learning mechanism is employed to effectively integrate the above models and achieve optimal segmentation results.(2)We optimized the inference deployment of EMNet by conducting statistical analysis on the inference time distribution of each module.We found that the LSTT module is the most time-consuming module,and that the long-term attention mechanism in the LSTT module has the highest computational complexity.To address this,we improved the classical attention mechanism in the LSTT module by introducing flow-attention,a network flow-based attention mechanism.By decomposing attention weight calculation using kernel methods,flow-attention achieved linear complexity by applying the associative law and avoided trivial attention by introducing competitive mechanisms via network flow.We designed a new model,EMNet(flow),based on this improvement,which effectively improved the model’s segmentation speed.(3) The experimental results on actual ultrasound video datasets demonstrate the effectiveness of the method proposed in this paper regarding segmentation accuracy and speed.The proposed method was validated using the lymphoma dataset,which frequently contains multiple segmentation objects in a single frame image.These objects have strong visual similarity with uninteresting objects such as blood vessels,and the same segmentation object often varies considerably in shape from one frame to the next,making segmentation quite challenging.By comparing the proposed method with seven benchmark algorithms for video object segmentation,the two proposed modules were found to effectively improve the ultrasound video segmentation accuracy of the model without introducing too many parameters,ensuring a certain level of segmentation efficiency.Finally,the proposed EMNe (flow) based on the network flow attention mechanism was compared with the original EMNet through experiments,and the improved model was found to significantly improve the segmentation speed of the model. |