| With the success of artificial intelligence and deep learning in classifying images,the research community has focused its interest on harnessing the powers of deep learning to facilitate various tasks that were considered challenging and impossible to achieve in the past.One such area where the deep learning has achieved much success is the detection of text in natural images.With the availability of more data and better computing resources,much progress has been made in applying deep learning for scene text detection and recognition with several state-of-the-art results that sometimes even surpass the human abilities.In this realm progress is being made in detecting text in more challenging scenarios for its wide range commercial applications.However,despite the huge success in the detecting challenging texts in natural images,the focus of most of the methods and datasets have been to attack only one aspect of the properties appearing in the scene images.Most of the datasets that are available for scene text detection incorporate a single language text that is mainly a Latin script.Limited progress is made in creating and curating datasets that can be used for multi lingual training of the deep learning models.Due to limited availability of the diverse language datasets,research is particularly focused on tackling the challenges that often appear in the images in the existing datasets.This dissertation attempts to diversify the realm of text detection to include images that contain texts from different languages and create a deep learning model for localization of these instances.Firstly,a dataset of 1200 images is proposed that comprises of scene text instances mainly in four languages i.e.English,Arabic,Persian and Urdu.This dataset also contains a small subset of images with text instances in Hebrew and Pashtu languages.To incorporate the diversity,imitate the natural conditions and induce an element of challenge these images are collected from a wide area with huge variations such as horizontal text,vertical,long and focused,short and obscure,curved and irregular text instances.An end-to-end convolutional neural network is designed comprising of three parts extracting the features using Res Net-50 which are then progressively merged to improve the ability of network to recall all text instances a prediction head then classifies the images into text/non-text areas,head/tail of the text instance and offsets for the bounding box using the head and tail parts.The experiments on the dataset show the efficiency of a simple and fast model on multilingual text detection with precision,recall and h-mean of 0.90,0.65 and 0.76 respectively on ICDAR MLT dataset. |