Font Size: a A A

Reconnaissance de la structure d'un document extrait de sa representation graphique (French text)

Posted on:2003-03-18Degree:Ph.DType:Thesis
University:Ecole Polytechnique, Montreal (Canada)Candidate:Poirier, BenoitFull Text:PDF
GTID:2468390011981803Subject:Computer Science
Abstract/Summary:PDF Full Text Request
The proliferation of electronic document formats impedes the dissemination and proper management of documents. Even if an organization is willing to convert all its document producing activities to a structure preserving format such as the HyperText Markup Language (HTML), the existing documents need to be converted. Indeed, a common format with structural information is required to obtain proper document indexing. While some formats are easy to decode, and preserve the document structure information, often the available representation is Postscript, where only the geometrical information remains. This thesis addresses the difficult problem of extracting the structure of a document from a geometrical representation. An interactive tool to extract the document content and structure from a geometric representation (Postscript) has been developed. It uses an incremental approach where you can interactively modify the information through all the different phases of the heuristic conversion. It successfully analyzes several documents produced with different tools, and produces structural information in HTML format. The tool is easily extended to recognize new constructs, and is aimed at organizations needing to convert numerous documents for searching and browsing on intranets or on the internet.
Keywords/Search Tags:Document, Structure, Representation
PDF Full Text Request
Related items