| Nowadays, along with the Internet fast development, the whole world could possibly communicate with all different people with using web. In order to spread information the network has already obtained enormous success. Through the development of about decade the Web has became a kind of main information source, it has already became a huge globalization information warehouse.At present, Internet usually uses HTML that is typical unstandardized knowledge, it cannot handle the various requirement of the Internet and it can't express the data itself. To do so, information from web sources needs to be accessible in a structured way. XML and its various extensions are a step in this direction. Unfortunately, the web is not yet a well-organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. The marking language XML is a subclass of the SGML, is a kind of meta-language, can make up a lot of shortage of HTML. Along with the development of the Semantic Web, the future web page will use the language of XML that is very good structural linguistics, but this stage of now is a transition period, we must consider a kind of method to carry out the HTML to XML conversion, using the network resources with the better land utilization.Web, the ocean of this information, much of this data is unstructured, which makes searching hard and traditional database querying impossible. Facing a great deal of information that usually comes from searcher engine, the customer that wants to find out the accurate information usually is at a loss. For this, the people hope for a kind of professional information extraction project, which can provide the accurate information source for us simply and directly. The Information Extraction (abbreviation for IE) is a kind of practical document technique for the concrete mission. Different from complicated natural language comprehension technique, the technique of IE usually adopts the simple analyzable technique of document, withdrawing the information of the particular topic that a designer concerns, such as: Message of the advertisement, news, the database natural language searching and particular realm advertisement, etc. Aim at this kind of circumstance, we put forward a ontology-based realm resources management platform, and the key point is the realization of the resources collection module, we use the technique of Ontology and DOM template in IE, put forward a kind of Ontology-based Web knowledge (HTML) IE system.In order to carry out IE of the web page, by using the saving technique of XML, theDOM template technique, the HTML-XML conversion technique, the bot grabbing web page, the lucene indexing technique and the Ontology technique, we put forward a kind of Ontology-based Web nonnormal knowledge (HTML) IE system that can achieve the conversion from the HTML nonnormal information to the XML normal information according to the request of realm ontology. For reducing the workload, this text adopts the existing and mature technique and tools as far as possible, so the work point is to adopt the technique of Java, technique of Ontology and the DOM template techniques and to use XSLT to carry out the conversion from the HTML document to the XML document. To extract the information from the page of HTML, we design a HTML-XML wrapper, and apply a recognizer to organize extracted constants as attribute values of tuples in a generated database schema. At last we save the XML document to Oracle database, thus carry out the conversion from the HTML to XML. |