| Intranet trust mechanism assumes that in the default institutions the relevant personnel who get in touch with the network are safe and trustworthy.However,it is normal for an insititution that the foreign staff visit to do some work on the user operation with computer,which is one of the insecure elements of the network.The intranet users are the main groups in the network,whose activities are flexible and difficult to predict,and many security incidents are caused by the illegal operation of intranet users.There are few constraints that limit the internal users’ behaviors so far.In order to identify threats in a large number of user operation logs effectively,we need the power of Big Data Computation to analyse network behaviors,rather than only rely on intranet trust.At present,the relevant algorithms based on decision tree in Spark platform,for example,are only Decision Tree,Random Forest and Gradient Boosting Decision Tree.The Decision Tree itself has the shortcoming which is easy to overfit,so it’s not applicable to the Intranet defense.Although the Random Forest can take full advantage of the parallel computing capacity in the actual operation of the Spark calculation,the complexity of the algorithm is still high under the premise in the pursuit of rapid convergence of the model.The Random Gradient Decision Tree has a complete mathematical theory support,but the dependency among the training data sets cannot give full play to the parallel performance in the distributed computing.This paper puts forward the Frequency of Eigen(Eigen Frequency)、the Frequency of Forest(Forest Frequency)and the Pseudo Boosting Decision Tree Algorithm(PBDT),according to the integration methods relevant to Decesion Tree and Combining with TFIDF algorithm idea.What’s more,the paper solves the problem that GBDT with the increasing of the iterations,whose wrong data could be marginalized.In PBDT,all decision trees are created based on original data set,respectively.It is unnecessary to sample data sets within each iteration,which contributes to the full use of the parallel performance in distributed computing.This paper also carries on the related experiment about Intranet defense on the proposed method on the distributed clusters.A series of different experimental results of RF algorithm and PBDT algorithm are obtained by changing the number of iterations and the scale of the training data set.It is indicated that the PBDT algorithm has better prediction accuracy in a certain scale training set. |