Font Size: a A A

Dynamic Circulation Corpus (DCC) Based Automatic Unlisted Term Extraction In The Field Of Information Technology

Posted on:2004-02-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q J WangFull Text:PDF
GTID:1115360092490033Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
This research disserts automatic unlisted terms extraction in the field of Information Technology based on the large-scale DCC (Dynamic Circulation Corpus), under the theory of Dynamic Updating of Language and Knowledge. It proposes the concept of Concatenation Index to decide whether a character string is a word/phrase or not. It presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms. This research chose the IT domain as the experimental object in order to draw the primitive research flow based on the theory of the Dynamic Updating of Language and Knowledge.This research introduces the frame work of Dynamic Updating of Language and Knowledge, and suggests a schema to improve the Dynamic Circulation Corpus (DCC). The schema makes it possible to enlarge the DCC both in content and structure while keeping compatible to the existed system.There are three basic characteristics of terms. They are: Terms usually only show up in one or some specialized domains; Terms are the phrases with the high degree of the circulation in its domain; and its circulation is near 0 in other domains. Unlisted terms are terms, hence, in nature, they also bear these three characteristics. Based on this, the basic thinking behind this research is to ascertain unlisted terms' possible distributing in the corpus through examining the enlisted terms in the corpus; and to set the best threshold for extracting unlisted terms through analyzing the extracting result under the different thresholds.Unlisted terms usually are unlisted words. There exists the same difficulty in distinguishing unlisted words as in extracting unlisted terms. Furthermore, the corpus under the traditional word segmentation would show great difficulty in extracting unlisted terms as in distinguishing the unlisted words. Therefore, this research adopts the traversing word segmentation method in preprocessing the corpus.There are two indices used in indicating whether a character string can be a term in the certain context. They are: unithood and termhood. This research suggests that the Concatenation Index should be used in measuring the unithood of a character string. And the experimentation shows that the use of the Concatenation Index, indeed, has the better effect in determining if a character string is a whole integrated word/phrase.This research also presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms. By using the Concatenation Index, we can decide the unithood of a character string. And by using the method of "TFIDF + Domains Subtracting", we can decide the termhood of a character string. This method was experimented on the DCC. It shows that the methods and techniques adopted in this research have the outstanding effect in processing the corpus and in extracting unlisted terms. Under the less human's interference, there are more unlisted terms being extracted. As a result, it partly realized the intention objective of the word segmentation.It also discusses two different processing modes for extracting the unlisted terms: "text-index-statistics mode" and "text-database mode" and their strong points and flaws. And more, it points out the "text-database mode" is a better method in the Dynamic Updating of Language and Knowledge at the aspect of the language monitoring in this paper.Putting it in other words, the main innovation of this research can be summed up as follows::(1) It proposes the concept of Concatenation Index;(2) It applies the Concatenation Index in measuring the unithood of a character string;(3) It presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms.This research drew the primitive research flow based on the theory of the Dynamic Updating of Language and Knowledge. It can be used as a prototype and as the valuable reference in extracting unlisted terms in other domains; and in building and updating the DCC.
Keywords/Search Tags:Dynamic Updating of Language and Knowledge, DCC (Dynamic Circulation Corpus), Automatic Term Extraction, Unlisted Term, Concatenation Index, TFIDF, Domains Subtracting
PDF Full Text Request
Related items