With the rapid development of Internet technology and big data, the complex disorder of big data recommended for users of the user’s own information of interest and recommended system has become a key tool for solving information overload. First, according to the historical record of user behavior, scholars construct corresponding feature works, and uses these features for algorithm modelling. Secondly, scholars predict the future user’s interest. Finally, scholars recommend corresponding items to users. But now recommendation system is still faced with many difficulties and challenges. For example, how to use the appropriate recommendation algorithm to enhance the accuracy of the positive samples (small samples) prediction, and how to improve the accuracy of users interested in these items. In order to deal with these corresponding problems, scholars continue to seek more efficient recommendation algorithm.Researches in this paper are mainly faced with problems that there are lack of explicit relationship information between passengers and lines under the condition of public traffic data, but there are huge implicit relationship information. Meanwhile whether passengers travel could be classified into binary classification problems. And the passengers who choose travel are far less than who don’t choose travel in the future. So the paper mainly focuses on how to build effective feature engineering and how to solve the imbalance classification in order to improve accuracy of small sample class classification. Based on the above problems, this paper proposes a cost sensitive learning and stochastic gradient algorithm to enhance the fusion of ideas, so as to predict whether passengers travel more accurately.Firstly, author builds users rides habits and their corresponding characteristics by use of public transport data set in Guangzhou. This article mainly from three directions to carry on the construction characteristic project. They include passenger, line and passenger line interaction. Author constructs this problem from several aspects, such as time, weather, frequency, and so on.Secondly, author optimizes the new features, and takes advantage of evaluation of the importance of the characteristics of random forests. Meanwhile author sorts these characteristic variables according to the importance in descending order, deleting unimportant feature, getting a new set of features. And then repeat these process, in order to get the highest accuracy of the model when the characteristic variables.Finally, a cost sensitive learning algorithm based on stochastic gradient algorithm is proposed. Therefor we can adapt to the unbalanced classification data set better, and be able to predict whether passengers future travel eventually.Because the basic algorithm of random lifting algorithm has lots of advantages such as not easy to fit, good generalization ability, nonlinear, and so on. So the improved algorithm is good to deal with the characteristics of the project. Firstly, the improved algorithm is trained on the public imbalanced data set and compared with other classification algorithms, using AUC evaluation index to evaluate the algorithm model. Meanwhile we record data set from the Guangzhou city passenger card. The suitable parameters are selected through several experiments, and the data set is trained by using the stochastic gradient algorithm and the cost sensitive learning stochastic gradient algorithm. Finally compare the performance of the two algorithms by evaluating indicators AUC and F1. |