Research And Implementation About Chinese Site Searching On PHP

Posted on:2010-05-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2178360278462574

Subject:IT project management

Abstract/Summary:

PDF Full Text Request

"Site Searching"is required by most websites who want to support users to data mining with an easy and efficient way. It's different between Chinese searching and English searching, because Chinese sentence is sequential words without space between them. And we have to focus on some key factors for PHP web application's performance, one is CPU utilization ratio and another is memory utilization. However, there's not a pure PHP solution with good performance in this field yet. The Apache extension and the PHP extension solutions have better performance, but they are only suitable for a small group of PHP environment. In the whole, the Chinese site searching on PHP integration established on the lightweight site-search engine and efficient Chinese tokenization engine which based on pre-indexed dictionary. An indexer generates term frequency and inverse document frequency from the contents of site. The indexed words and its rated weight are stored in database. A searcher can calculate the relativity ratio of the contents based on several factors, including term rated weight. A render returns user ordered results with highlighted keywords. A Chinese word splitter functions as the core of site search engine. This topic focuses on three important fields of Chinese site searching technology on PHP:(1) The present thesis makes a lightweight and efficient Chinese searching framework on PHP with combination of same word splitter in indexer and searcher. It also makes a fast word tokenization algorithm with veracity ratio greater than 90% and spending much fewer server resources. The framework also has good error tolerance in Chinese tokenization to keep PHP application lightweight and usable. It has certain instructional meaning for the design and development of performance oriented PHP web applications.(2) The present thesis provides a model for calculating the relativity ratio of the contents based on multi-factors rated weight. The traditional"relativity ratio"comes from the term's frequency both in documents and in indexed database. In regard of the website contents, the word's weight can be prompted by the HTML tag around it. And the attributes of content can mention the important weight of the document, such page view count, comment count etc. So it's possible to tune the traditional model with HTML tag analysis and document status factors. It is notable improve the ordered results of the search.(3) On implementation of Chinese site searching technology, the quality of word splitter is most important. In order to avoid the performance loss with mass dictionary, the operational method is searching in disk with great capacity dictionary. The dictionary is pre-indexed and organized with B-tree, which contains more than 530,000 words to support both Simplified and Traditional Chinese. It assures the correction ratio and keeps lightweight by disk searching. So it has practical meaning for the PHP domain.This thesis provides an effective method for analyzing and solving the key problems of Chinese searching, site indexing and Chinese tokenization on PHP. With the fast development of PHP websites and the growing of Chinese information, the method in the present thesis has instructional meaning to the searching technology on PHP. Meanwhile, as the advanced model with broad adaptability, the analysis and research made in the present thesis provide a meaningful exploration for the Chinese site searching.

Keywords/Search Tags:

PHP, Site Searching, Chinese Tokenization, Dictionary, Relativity Ratio, B-tree

PDF Full Text Request

Related items

1	Establishing English-Mongolian-Chinese Electronic Dictionary Based On Tree Structure And Researching The Encryption Algorithm
2	Breadth-first Minimum Spanning Tree And Hownet Lexical Semantic Similarity-based Heuristic P2p Search Technology And Realization
3	The Research And Implementation On The Medical Site In-site Search Engine Basing On Double Tokenizers
4	A Design And Realization Of A Chinese Dictionary Based On B-Tree And Berkeley DB
5	A Design And Realization Of A Chinese Dictionary Based On B-tree And Berkeley Db
6	Network Resources Searching Study Based On Content's Directory Tree
7	The Design And Implementation Of Multilingual Mongolian-Chinese-English Dictionary Resource Management Platform
8	Data Structure Design And Game-Tree Search Algorithm Research For Chinese Chess Computer Game
9	Design And Implementation Of Tokenization System For Payment
10	Research On Some Key Technologies In Web Site Summarization