| With the rapid development of web technology, all kinds of information on the internet is increasing rapidly with the explosive growth, so how to find the useful pattern and interesting information efficiently from the data set which contains huge information, has become an urgent problem to be solved, and web data mining technology appears under this background.This paper analyses the web data mining technology in the text classification technology, and puts emphasis upon Chinese text classification, including feature weighting, feature selection based on identifying community, and the measure of similarity based on graph space model. The main contents done in this paper are as follows. First, by analyzing the basic theory of Gini index and feature weighting, we present a novel weighting formula based on Gini index. The results show the novel weighting formula improves the performance of text classification. Second, by studying the conception of community structure in complex network, we propose a new algorithm of feature selection based on identifying community, and it can overcome the weakness of the traditional feature selection methods omitting the semantic context. The experimental results also show that the presented algorithm is efficient. Third, by studying the graph space model, we propose an improved standard of text similarity measure based on the analysis of structure equivalence, and it overcomes the defect of the text structural information which can not be expressed efficiently with vector space model. Results show the new standard is efficient and feasible in the field of text classification. |