Font Size: a A A

Research On Form Recognition In Printed Document Recognition System

Posted on:2014-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2268330425966726Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
A great many documents, or forms, are used in business every day. In many fields, forinstance, personnel have to handle forms that clients use to pay taxes and librarians need tocollect data in paper form. Attempts have been made to make their work load easier byimaging standard forms from which they can acquire data. A traditional way of doing this,one that has been used by many businesses, is to use optical character recognition (OCR) inthe handling of forms. Using OCR has improved work quality in these businesses and greatlyreduced the amount of time that personnel spend in handling forms. In most cases where OCRis used, we should acquire the form-templates which are formed to enable users to knowwhere target strings are printed. These may include items such as text information andmathematical formulas, etc. However, the forms and tables could be obstacles to the dataclassification. Thus, form detection/removal is an essential task for digital archiving. Forthese reasons, we need a practical table-form recognition system to deal with these problems.In this paper, Printed form-document Recognition is deeply studied based on Printeddocument Recognition Research, and we just implement a form recognition system partly.The traditional form recognition system is composed of two parts: table-form frame extraction,data extraction and form redraw.In the frame extraction part, firstly our form removal scheme and image classification isproposed to work on bi-level images. Thus, we need a good image binarization to convert animage of up to256gray levels to a black and white image, and then we use a modified Houghtransform to fulfill the tilt elimination. Secondly, layout analysis is important to detect theform area so that we could extract the data form table easier. Finally, Search the lines in thetable-form pictures. When we finish the form frame extracting, the lines which belong to theframe tend to be broken or non-aligning. What’s more, we need to connect the broken formlines and make their aligning based on their position, and then combine the extractedhorizontal and vertical lines to the form frame.In the data extraction part, because of the increasing varieties of forms and complicationsin the forms, cell extraction has become a key factor in automatic form recognition. Theproposed method produces cell candidates by using intersection features and tracingintersection points to form closed regions. Then mathematical morphology is used for linesremoval, so we could obtain all the data without form frame information. So using opticalcharacter recognition (OCR) technology to recognize the data we extracted, and writing thedata in the extracted form frame to fulfill the from redraw. In summary, an experimental system for Printed form-document Recognition isestablished to valid above algorithms, such as binarization, tilt detection, layout segmentation,and table recognizing algorithm. Final results show that these are resultful and universal inanalyzing the form images.
Keywords/Search Tags:Binarization, Form lines, OCR, Form recognition
PDF Full Text Request
Related items