| Nowadays,“Internet + medical health” is developing rapidly,Internet consultation behavior is becoming more and more common.Since the text generated by online consultation behavior has characteristics such as large amount,no tags,and wide range of topics,many online communication medical forums fail to realize the modular classification of content placement.The indiscriminate and random placement of topics is often not conducive to the accuracy and speed of Internet users in their search of relevant content in the field they want to understand.In addition,many Internet users often like to use search engines to help self-diagnosis when they have uncomfortable symptoms,but if there are keywords missing or imprecise in the search process,it is easy to cause misdiagnosis and omission of the disease.Lung cancer,the No.1 cancer in China,has an early cure rate of 100%.When lung cancer patients with early symptoms use search engines to search for symptoms,there is a specific lung cancer symptom database of terms that can be used as search recommendations to prevent the omission of the search content,and when combined with the learning and self-examination of common lung cancer symptoms in medical online communication websites,it can greatly improve patients’ awareness of lung cancer screening and thus enhance the detection rate of lung cancer.At the same time,the tagless nature of online medical texts determines that whether it is modular placement of lung cancer Q&A content or building a keyword database of lung cancer symptoms,it is necessary to adopt text mining techniques with clustering algorithms to achieve the goal.In this paper,the tumor text of the Chinese medical dialogue dataset is used as the research object,and the clustering of lung cancer short text and keywords is carried out.In order to improve the accuracy of the clustering process for lung cancer-related texts,a more accurate distance measure of the text is needed,that is,to reduce the information loss of the text in the structuring process.For the problem of short text clustering of lung cancer,this paper proposes an auxiliary algorithm to the traditional text representation model,which simulates the word distribution with Poisson distribution on the basis of the traditional model and which retains more probabilistic information of the text,so that the text distance can be measured more accurately.Through a large number of comparison experiments and word cloud visualization results,it can be seen that the integrated algorithm of the traditional algorithm and the auxiliary algorithm can effectively extract the short text of lung cancer by clustering compared with the three traditional algorithms which are TF-IDF,Word2 Vec,and Word2 Vec based on TF-IDF weighting on the indexes of ARI,AMI,and FMI.For the problem of lung cancer word vector clustering in establishing a specific thesaurus for lung cancer symptoms,this paper uses knowledge graphs combined with the clustering method of co-occurrence distance metric to extract and expand the key words of lung cancer symptom category from three perspectives: scholars,Internet users and medical practitioners.A specific thesaurus of lung cancer symptoms search recommendations was established.The final fit test with Baidu search index shows that the key words in the group of "early symptoms of lung cancer" can complement each other in the search engine,and the more words co-occur in this group,the higher the possibility of getting lung cancer screening recommendations. |