Font Size: a A A

Research On Web Text Mining System Based On XML And SVM

Posted on:2015-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:D TangFull Text:PDF
GTID:2308330473451650Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source MapReduce implementation which is being used in companies like RackSpace, Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the MapReduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.The dissertation researches the Web text mining indetail according to the process of Web text mining,constructs a Web text mining model based extensible Markup Language(XML) and support vecor machine(SVM).The dissertation indicates structuring the information in Webpages by XML,and then express can deal with,extract useful information for text mining,reduce the amount of data,form a text feature database text preprocessing influence the quality and efficiency of Web text mining,therefore,Web text preprocessing is very important for Web text mining,it need particulaar and integrated research.The dissertation also constructs a Web text mining model,the Web text mining model based on XML and SVM possesses function of Web text preprocessing and Web text mining,its advantages are reducing amount of data step by step by fixing on authority pages,XML technique,feature selection in order to obtain term gather that can dimonsion of high-dimension data by support vectors machine,refines data that text mining need to process. Zoot supports queries expressed in a SQL-like declarative language- HiveQL, which are compiled into MapReduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom MapReduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Zoot also includes a system catalog-Metastore – that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation.
Keywords/Search Tags:Webtextmining, WebtextPreProeessing, XML, Zoot
PDF Full Text Request
Related items