| Visual relation,such as "person holds dog" is an effective semantic unit for image understanding,as well as a bridge to connect computer vision and natural language.Recent work has been proposed to extract the object features in the image with the aid of respective textual description.However,very little work has been done to combine the multi-modal information to model the subject-predicate-object relation triplets to obtain deeper scene understanding.In this paper,we propose a novel visual relation extraction model named Multi-modal Translation Embedding Based Model to integrate the visual information and respective textual knowledge base.For that,our proposed model places objects of the image as well as their semantic relationships in two different low-dimensional spaces where the relation can be modeled as a simple translation vector to connect the entity descriptions in the knowledge graph.Moreover,we also propose a visual phrase learning method to capture the interactions between objects of the image to enhance the performance of visual relation extraction.Experiments are conducted on two real world dataset,which show that our proposed model can beneft from incorporating the language information into the relation embeddings and provide signifcant improvement compared to the state-of-the-art methods.We propose a dynamic computational time model to accelerate the average processing time for recurrent visual attention(RAM).Rather than attention with a fixed number of steps for each input image,the model learns to decide when to stop on the fly.To achieve this,we add an additional continue/stop action per time step to RAM and use reinforcement learning to learn both the optimal attention policy and stopping policy.The modification is simple but could dramatically save the average computational time while keeping the same recognition performance as RAM.Experimental results on CUB-200-2011 and Stanford Cars dataset demonstrate the dynamic computational model can work effectively for fine-grained image recognition. |