| With the fast development of the Internet,World Wide Web has become a huge distributed information space,which provides users a massive and valuable information resource.But,when search engines are used for information retrieval on Internet,the returned results are so extremely huge that users often find it difficult to seek the quite consistent useful information from the complex magnanimous information.The technology of web information extraction and data fusion is one of the important ways to solve this question.The web information extraction can carry on structurized processes for the information in each kind of different text to express unified and structurized form by locating and distinguishing the needed information point.The data fusion mainly carry on automatic detecting,connecting and combinating processes for datas from many information source to expand time and spatial observation scope,to strengthen datas' confidence level.In this thesis,the commodity information extraction and fusion technology are researched.A commodity information extraction method and a corresponding data fusion method are proposed.This method adopts the commodity information online extraction and the weight correlates method of data fusion,unifying the web commodity information characteristic.And the thesis gives corresponding realization by quoting Google Web API,HtmlParser,the regular expression and weight coefficient.The content is as follows:1)The thesis presents the web acquisition technology which integrating Google Web API into java application to search and acquisition web and introducing regular expression to find out the interrelated links in the Web.Then,these collection are stored to the local disk,waiting for analysis in next step2)Build a commodity parameter database as far as possible completely after having mastered the knowledge of the commodity parameters.And realize the commodity information extraction fast and accurately through the source technology of HtmlParser and the matched regular expression based on the parameter database. In the process of extraction,the table blocks and the div blocks are only parsered by characteristics of the commodity pages resource code,which enhance the speed of distinguishing and analysing.3)Obtain a corresponding weight coefficient table by analysing the extracted specific data set.Then carry on data fusion with the table based on the weight coefficient method.Finally the fusion datas are saved to the history database and the system presents a quite complete information view for the users.4)The thesis designs the system of commodity information extraction and data fusion based on the Web in the mass,and realizes the system.Through testing and anlysis to some kinds of mobile phone' information,the system can extract hundreds of interrelated commodity information online,then carry on data infusion,which lays the foundation for developing more special and far-ranging system in the further. |