Font Size: a A A

User Behaviors Analysis Methods Base On Network Log

Posted on:2017-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Q LiFull Text:PDF
GTID:1318330566956053Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,network information and applications have been greatly enriched.Online activities such as people’s daily work,learning,entertainment and shopping have close relationship with Internet.Compared with the real world,network has become more relative to people’s life,and forms a"digital space".They intersect on the dimension of person and interact each other.Therefore,user behavious analysis over the Internet can provide the guidance for network space planning and web content management,but also it can lay a solid foundation for discovering the emotions,preferences and Internet dependence of diverse groups.In this paper,we conduct the reasech based on the scene of campus network and focus on the network logs with a large-scale data.A variety of data mining technologies,such as topic crawler,topic modeling,text classdification,clustering and distributed computing,is used for analyzing the web access logs in order to discovery the online preferences,hotfoucs and duration of different grade students.The contributions of this paper can be summarized as follows:1.As the integrity of training data has a great impact on textual classifier,the larger the dataset is,the better classifier is.However,duo to the different majors under the environment of campus networks,the data contains the general domains(e.g.,sports,entertainment)as well as the specific contents in each expertise respectively.Therefore,to achieve better classification performance,it is necessary to enrich the original dataset,which constitutes of the crawled pages that the students accessed,with the professional contents of each major.In this paper,we propose an imporved Shark Search algorithm based on ontology,called Ontology-VSM.The approach introduces semantic domain ontology model to overcome the disadvantages of vector space model that neglects the correlation of features and enhances the similarity engine.Furthermore,the DOM tree structure is used to cluster the web links during the correlation is assigned.Experimental results show that our model has a great advantage over the origiral one.2.The location of the campus network has strong relation with students’grade,major and gender.In addition,location information and IP address have very obvious regional characteristics and mapping relationship.Therefore,if introducing the IP address into topic model,it will reflect the user behavious of different locations significantly.This paper extends the classic LDA topic model and introduces the"topic-location distritution",then proposes a 4-layer topic model,called Area-LDA.Area-LDA mines the latent topic through time slice log and intuitively reflects the association of different grades,professional,sex with the preference and hotfocus.The experiment is conducted over real-world dataset and the results indicate the diversity of different sexes and grades.3.Feature selection is a key step of automatical classification,which has a great influence of final precision.This paper improved the Na?ve Bayes and AdaBoost model respectively:(1)proposes TF-D(t)-CHI,which uses chi-square(χ~2)instead of IDF to reflect the relation between feature and label,and introduces variance to embody the features within the categories,in order to enhance the performance of traditional Na?ve Bayes.(2)proposes LDA-Ada Boost,which uses LDA topic model instead of BOW model to reduce the feature dimension.The two models above have obvious superiority on various evaluation metrics.4.It is an important part of college management to discover the students within Internet addiction.However,No previous work or effective methods can estimate the user’s online duration.This paper proposes a novel algorithm that estimates the online duration by density clustering over the web log data in temporal dimension.Moreover,we adopt Spark distributed computing framework to improve the calculation speed for applying the algorithm on the large-scale data environment.By analyzing the exmeriment results,our algorithm can discover the diversity and distribution of online duration effectively and efficiently.
Keywords/Search Tags:network-log analysis, topic crawler, topic modeling, text classification, cluster analysis, distributed computing
PDF Full Text Request
Related items