| As the widespread of personal computer user and the rapid development of Internet, thenumber of Internet user and Internet site is increasing quickly, hence the information on theInternet is also increasing quickly. It is a challenge that how to deal with so much information.Traditional information retrieval methods that based on string matching can not meet now, andSemantic-based information processing emerges.In the field of natural language processing, intelligent retrieval, text clustering and soon,Semantic similarity calculation is a fundamental problem.There are two main ways to calculatewords similarity: One is based on the structure of knowledge which build by linguists, such assemantic dictionary or semantic network, and this methed is called the subjective method. The otheris based on large-scale corpus and this method is called the subjective method. The method which isbased on the structure of knowledge needs linguists to define the information of word, thenaccording to the characteristics of the information to calcutlate the similarity. The method which isbased on large corpus use statistical methods to calculate the similarity.This thesis studies the algorithms based on―Hownet‖and large-scale corpus to calculate thewords similarity. An improved objective and subjective combination of word semantic similarityalgorithm is proposed. In the calculation process, the algorithm eliminates interference factors andmakes the result conform both subjective concept and objective semantic environment.The text is one of the most important carriers in the Internet world, and text similaritycalculation is the basis of text classification and text clustering. This thesis proposed a dual-leveltext similarity algorithm. The text is divided into two levels: one is title information and the other istext content information, and text similarity consists of two parts. In the calculation process, thisthesis uses the improved objective and subjective combination of word semantic similarityalgorithm, and makes the result conform both subjective concept and objective semanticenvironment.This thesis has built an experimental platform. By comparing and analysising theexperimental results, this algorithm has improved the results in semantic similarity of words andtexts. |