| With the fast development of the Internet,search engines have become an important tool for people to obtain all kinds of information.In recent years,the search engines,such as Baidu and Google,they are difficult to achieve accurate results due to their wide search coverage.The topic search engines for specific areas can help users filter a lot of irrelevant information,and users can quickly and accurately obtain the information they need.In order to help financial practitioners get the financial text data accurately and efficiently in the large amount of web pages.The purpose of this article is focus on the financial field,research on fast and effective web crawler technology.This paper proposes a method for extracting keywords from web pages assisted by knowledge graphs.To achieve the efficient topic crawler,one method is selecting the relevant pages by combining the link structure of certain rules and the semantic similarity calculation between key phrases and themes.The main contents and methods of the study are as follows:(1)Aiming to the problem of topic description in topic crawler technology,this article proposes a method of constructing financial knowledge map to describe topics.And choosing to use the Bert-Bi LSTM-CRF model to extract named entities and relationships from financial related texts,and performs knowledge fusion on heterogeneous data to solve the problems of inconsistent and missing entity attribute values.In the final step,Neo4 j is being used to realize the persistent storage of triple data and complete the construction of financial knowledge graph which named Fin Graph.(2)Aiming to the problem of crawling strategy in topic web crawler technology,a key phrase extraction algorithm based on knowledge map is proposed.AP clustering algorithm based on semantics is applied to text.This paper uses the financial knowledge map to connect the words in the cluster to the entities in the knowledge map,mines the potential relationship between words through the semantic network structure,gives the edge weight to quantify the potential relationship,constructs the relational word map.And constructs the framework of extracting key short words by integrating AP clustering algorithm and graph centrality algorithm,aim to screen out the pages related to financial topics and reduce the interference of irrelevant information,so that the results returned by the topic crawler have a high accuracy.(3)Combined the above two research contents,this paper designs a hybrid theme web crawler,which is according to combine the content of web page text and link structure to determine the theme.This paper uses Fin Graph to extracted key phrases from the web page text,combine the extracted key phrases and topics to calculate the semantic similarity,and at the same time consider the link structure to filter out the more relevant pages.Finally,Fin Graph is further supplemented according to the crawled web page text. |