Font Size: a A A

Design And Implementation Of Content Identify Module In Data-Management Platform

Posted on:2016-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:C L HuFull Text:PDF
GTID:2308330470955568Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of intelligent devices, more and more people joined the Internet through a variety of intelligent devices. If the massive user data that produced during this period, can be used effectively, will produce tremendous value. On the other hand, with the reduction of distributed clusters’ cost and the mature of distributed algorithms, to analyse large quantities of data is becoming more and more convenient and efficient. The project which is described in this article is an application in the advertising industry that using distributed clusters and distributed algorithms.By analysing huge amounts of users’ data that generated when people surfing the Internet, the project that this thesis involved in aims to find the most valuable crowd who may buy the product when they see the advertisement so as to reform the way of advertising in the advertising industry, which turns blind advertising into an accurate way. The author’s work focuses on the development of crawler project, the formulation of the architecture and rules of content identification system, the development and testing of content identification system and the system log analysis for advertising.The project is developed by the Java and Python program language and running on hadoop clusters. According to the identification (domain name, products, applications, search keywords, Cookield, terminal type, the User Agent, Token, etc.) rules which are collected by the crawler (Scrapy) then stored in relational database (MySQL), the project will analyse, summarize, then make model using the massive users’ online data that stored in NoSQL database (Hive), then send the results to the high performance key-value database which named Redis so that query related people’s data when advertising. The project is aimed at providing decision-making basis for ads bidding process so as to achieve the target of showing the ad to the crowd who most likely to occur buying behavior, in other words, improving the rate of Return On Investment.
Keywords/Search Tags:Distributed clusters, Distributed algorithm, NoSQL Database, Crawler, Content Identify, Advertising
PDF Full Text Request
Related items