Font Size: a A A

Study On Document Image Layout Analysis Technology

Posted on:2012-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:S ShiFull Text:PDF
GTID:2178330335490671Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Document image layout analysis is an important component of document information processing system, and an essential part of complex documents OCR. It is a key step of the digitalization of paper documents. Document layout analysis technique is widely used in automatic document retrieval, office automation and other fields. But due to diverse type and complex structure of document layout, current technology of layout analysis still has certain limitations. So layout analysis is of great significance and application. Layout analysis includes layout segmentation and region recognition.According to the inadaptability to complex layout of traditional top-down methods, the method based on segmentation line extraction is put forward in this paper. Firstly, the algorithm extracts initial segmentation lines whose length and width greater than a certain threshold value. And an adaptive threshold method is put forward to solve the problem of inflexibility of fixed threshold method; Then initial segmentation lines are clustered into line clusters. A hierarchical cluster algorithm is used to acquire the line cluster'complex shape and direction of main axis. Then the main axis of the line clusters are extracted as final segmentation lines from simplified line clusters using a certain strategy; Based on the relational model of crossing points formed by segmentation lines, a closed polygon search algorithm is used to segment the document layout into regions. Finally, the effectiveness of the segmentation results is enhanced by the filter and merging of regions.According to the inefficiency of existing methods which investigate object's attributes at the same level, this paper put forward a hierarchical attribute-based recognition algorithm. Firstly, probability distributions of various objects'attributes are obtained by sample statistics, and the concept of attribute distinction ability which expresses the ability of distinguishing objects is introduced. Then the object attribute table is built, biggest attribute distinction ability is extracted in the recognition process every time, and probabilities of objects are calculated. The algorithm achieves a hierarchical recognition process which extracts attribute and estimates object type gradually until object type confirmed finally. In this paper, a size of 3*5 object attribute table is constructed for the document layout regions obtained from layout segmentation, and layout region recognition is well achieved by using the algorithm.By comprehensive experimental analysis, the layout analysis method proposed in this paper has good adaptability for different layout types and shooting conditions, and has a high segmentation rate and recognition rate.
Keywords/Search Tags:document image, layout analysis, layout segmentation, region recognition
PDF Full Text Request
Related items