Font Size: a A A

Study On Unsupervised Keyword Extraction From Scholarly Text:Integrating Structural And Semantic Information

Posted on:2024-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y G TuFull Text:PDF
GTID:2555306920456304Subject:English Language and Literature
Abstract/Summary:PDF Full Text Request
With the advent of information age,a powerful "data deluge" is sweeping across the globe,including the academic field.The rapid proliferation of online scientific documents involving a wide range of research fields have made the management and utilization of academic documents increasingly complex.This makes the technology of keyword extraction extremely valuable and demanding.Automatic Keyword Extraction(AKE)is an important task in the field of information retrieval and natural language processing,which can extract the most important and representative keywords from a large amount of text data.It can help researchers quickly understand the topic and content of documents,promote the classification,search and recommendation of the literature,and provide reference for research directions.Existing methods of keyword extraction can be divided into supervised and unsupervised branches.Unsupervised methods do not require a huge number of manually annotated corpus as training data,and thus have the advantages of wide applicability,insensitivity to sample bias,and strong scalability.However,their performance is often inferior to supervised methods,and there exist problems like grammatical brokenness,reliance on frequency,content generality,and semantic redundancy.To address these issues,this paper proposes a keyword extraction algorithm Structural and Semantic Rank(SSRank)that combines structural and semantic information in scholarly text based on the four essential characteristics of keywords:termhood,distribution,informativeness,and diversity.Structurally,the algorithm constructs a sophisticated noun phrase detection framework based on POS tagging and dependency parsing to preserve grammatical integrity of candidate phrases;then,it models the distribution of words in an undirected and edge-weighted textual graph,and iteratively computes their importance scores by utilizing word co-occurrence relationships and positional information.Finally,a nonlinear length formula inspired by Zipf s law is applied to pool the component word scores of each phrase to obtain the phrase importance.Semantically,SSRank uses semantic distance to quantify the diversity of candidate phrases,grouping them into different clusters to cover more topics.Specifically,four different distance measures are constructed,including overlap distance,edit distance,pseudo one-hot encoding(POE)cosine distance and average word embedding distance,and clustering is performed using HAC or K-Means.Finally,a method called keyword chaining is used for candidate selection,which sorts the semantic groups according to within-group average distance and outer topical centrality,and then chains the top-ranking phrases from the leading groups to represent the optimal set of keywords.To prove the effectiveness of SSRank in keyword extraction,the thesis conducted an extensive set of comparison studies on five datasets of different scales and disciplines,and found that SSRank outperformed all the benchmarks.The results of ablation studies demonstrate the effectiveness of its critical components:(1)The noun phrase detector is superior to existing frameworks,achieving an accuracy of nearly or over 90%on multiple datasets;(2)Semantic clustering can improve the diversity of extracted keywords,and increase the F-score of the model by 12.62%;(3)Overall,compared to the original TextRank,the modified AKE algorithm SSRank can make significantly improvements on performance.In order to further verify the time efficiency and effectiveness of SSRank on large-scale documents,a total of 25173 keywords were extracted from the ACL corpus containing 4814 abstracts,and the term co-occurrence network and keyword annual trends were displayed in chart,reflecting the practical value of SSRank.
Keywords/Search Tags:extraction, PageRank, syntactic parsing, semantic clustering
PDF Full Text Request
Related items