| | Research On Affective Visual Question Answering |  | Posted on:2020-01-29 | Degree:Doctor | Type:Dissertation |  | Institution:University | Candidate:NELSON RUWA | Full Text:PDF |  | GTID:1368330596996753 | Subject:Computer Science and Technology |  | Abstract/Summary: |  PDF Full Text Request |  | The concept of visual question answering(VQA)has recently attracted the attention of many researchers in the field of machine learning.Different attention models have been proposed in VQA for the purpose of addressing the need to focus on local regions of an image.We were motivated to embark on this research after noting that despite the popularity of VQA of late,researchers were leaving out essential affective details on images and videos during feature extraction,and the answers also do not provide affective information.The dissertation therefore aims to fill in the gap left by other researchers by preparing the relevant data,analyzing them and generating answers that show more natural image understanding.Specifically,we mainly focus on affective VQA with respect to images with a single-emotion,images with multiple emotions and videos.The work looks at generic images and videos,but the mechanisms used can be directly applied in fields such as education,guidance for the blind,health and others.The main contributions are listed as follows:(1)We firstly propose a mood-aware image question answering(MAIQA)method.Improved image feature extraction and attention techniques are important in improving VQA performance.The MAIQA architecture works as follows: Visual features from the convolutional neural network(CNN)image feature extractor,the mood token from the mood detector and the textual question are fed into a common long short term memory(LSTM)module.A novel CNN attention algorithm enables the model to focus on relevant parts of the image,according to the provided question and mood.The softmax classifier accepts the weighted sum from the attention mechanism as an input to a multi-layer perceptron that generates a mood-based answer.We managed to investigate how much the injection of a mood into the attention mechanism and the use of a new CNN attention algorithm affect VQA performance.When the number of views and the kernel length are optimized,the CNN attention operation is more effective than the traditional LSTM and gated recurrent unit(GRU)attention operations.The experimental results prove that MAIQA outperforms the previous state-of-the-art baselines in several instances.The additional attention on the mood does not only improve classification accuracy,but also substantially contributes towards the improved analysis and comprehension of image features.(2)A multi-mood image question answering(MMIQA)method is proposed.Using only a single mood from the image is inadequate,since an image can have multiple moods that may affect VQA performance.Feature embedding is an important aspect of VQA that can also influence performance.The MMIQA architecture operates as follows: The CNN multi-mood detector recognizes all distinct mood attributes from each region of the image and loads the attributes into a textual string.After the CNN question and mood embedding,the features of the image,question and moods are jointly attended to by an LSTM triple attention mechanism.Our modified Hadmard product handles the fusion of the three kinds of features to enable the generation of a fully affective answer.As an improvement on the previous method,the multiple moods are not solely applicable to persons,but also to the other objects and the general environment appearing in the image.Of the three feature embedding techniques used,CNN influenced the model to achieve the best classification accuracy,followed by LSTM and lastly GRU.The results of the experiments clearly show that high accuracy levels can be achieved together with a multi-mood answer.Our model outperforms recent VQA and mood analysis baseline models.Better image understanding can also be achieved by analyzing multiple mood attributes from the different regions of an image.(3)We thirdly propose a multi-mood video question answering(MMVQA)method.When the concept of affective question answering is applied on videos,improvement in performance is expected just like in affective image question answering.MMVQA is a multi-task learning architecture that operates as follows: A pre-trained CNN mood detector recognizes moods on the frames of a video,and a string of the mood labels is relayed to the attention mechanism,together with the visual features from the CNN feature extractor and the text question.Along the video QA route,there are three complementary attention models: token-based attention,frame-based attention and integrated attention.Along the affective route,the captioning module uses the string of mood labels to generate an emotion caption that will be used by the text QA module to prepare an affective answer.During ensembling,a conventional answer is generated from the processes that take place along the video QA route and an affective answer is generated from the processes in both routes.The use of the challenging spontaneous videos from the wild to train the mood detector improved mood detection during testing and therefore significantly contributed to the improvement of the overall classification accuracy.Our video mood detector scored 62% classification accuracy which outperforms other models tested on the same data.Ablation studies proved that integrated attention was the best,followed by the frame-based attention,and lastly the token-based attention.Our method does not only make VQA more analytic by generating a fully affective answer,but also registers quantitative improvement in performance,when compared with previous baselines.It was also discovered that the injection of moods into the attention mechanism boosts the performance of the model,but too many moods on the same video will gradually reduce accuracy levels. |  | Keywords/Search Tags: | visual question answering, mood detection, convolutional neural network, long short term memory, video question answering, video captioning, multi-task learning |  |  PDF Full Text Request |  | Related items | 
 |  |  |