Humans use rich natural language to describe and communicate the visual content such as videos and images. In this thesis we employ a two-step approach to generate natural language descriptions for videos automatically. In the first step, a rich semantic representation of the visual content including e.g. activities and objects are predicted. In the second step, we approach the generation from predicted semantic representations as a statistical machine translation problem. The semantic representation is considered as source language and the natural language description is treated as the target language.We learn the translation model from a parallel corpus namely TACo S [1] which consists of video snippets, low-level annotation and corresponding natural language descriptions.We also apply word lattice decoding to deal with the uncertainty in the predicted semantic representations. Both automatic evaluation, i.e. BLEU and human judgments show that our approach improves significantly over several baseline systems inspired from related work. Our translation approach also shows improvement over related work on an image description task. |