Font Size: a A A

Analysis On The Influencing Factors Of Citation Frequency And Downloads Of Scientific Papers

Posted on:2020-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ShiFull Text:PDF
GTID:2370330572980319Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Scientific papers are one of the important materialized forms of scientific research achievements,as well as an important reference for evaluating the scientific and technological level,academic level and scientific research ability of individuals and countries.The citation rate of papers is a sign of the acceptance of country's scientific research papers by other countries or institutions,and the amount of downloads can reflect the attractiveness of papers.Foreign scholars have studied English paper on biology,mathematics and other natural sciences.There are many factors that affect the frequency of citations,but few scholars have studied Chinese paper.Therefore,the data of Chinese papers published in twenty journals from 2007 to 2016 in biology,mathematics,physics and resource science are obtained from CNKI,which mainly studies the influencing factors of citation frequency and downloads,and establishes the optimal model of fitting citation frequency and downloads,and identifies highfrequency citation papers.It has certain value for evaluating the quality of papers and provides help for the importance of research.Firstly,this paper explores the distribution law of citation frequency and downloads under different attribute characteristics of Chinese papers,and uses Pearson correlation coefficient test,Kruskal-Wallis test,Nemenyi test and Wilcoxon test to test the correlation between each attribute characteristic and citation frequency.Secondly,the paper fits models of citation frequency and downloads,because of the Zero-inflated phenomenon of the citation frequency,the Poisson regression model,the negative binomial regression model,the Zero-inflated Poisson regression model and the Zeroinflated negative binomial regression model are fitted for the citation frequency.Only the traditional statistical model is fitted for the downloads.Likelihood ratio test and AIC,BIC criteria are used to compare the most suitable models for fitting the citation frequency and downloads.The effects of the attributes of the paper on the citation frequency and downloads are discussed based on the optimal models.Finally,Logistic regression model,classification tree,support vector machine and k-nearest neighbor model are used to identify high-frequency cited papers.Because the high-frequency citation of papers is unbalanced that the proportion of high-frequency citation in samples is very small,this paper uses SMOTE algorithm to balance the data,identifies the unprocessed data and the balanced data respectively.It compares the classification effect before and after data balancing and identities the high-frequency citation of papers in biology and physics and evaluates the four classifiers according to the accuracy,recall rate and AUC values,The results show that the optimum model for fitting the cited frequency is zero expansion negative binomial regression model,and the optimum model for fitting downloads is negative binomial regression model.Through the analysis of the model,the factors that have significant influence on the citation of papers are: downloads,papers length,titles length,published year,journals grade,subjects category,the number of abstracts,the number of keywords,and whether the papers are completed in cooperation.The influencing factors of citation frequency are: downloads,titles length,published year,journals grade,subjects category,the number of abstracts words,and the number of keywords.The factors that have significant influence on the downloads of papers are citation frequency,papers length,titles length,published year,journal grade,subject category,the number of abstract words and cooperation.The classification results after data balancing based on SMOTE algorithm are better than those without data balancing.Although the accuracy decreases,it still maintains a higher accuracy,the recall rate and AUC value are significantly improved.For biology and physics,there are differences in high-frequency citation of papers between different disciplines.By comparing the classification results of Logistic regression model,classification tree,support vector machine and k-nearest neighbor model,classification tree has better recognition effect for high-frequency citation papers of biology,and support vector machine has better recognition effect for high-frequency citation papers of physics.
Keywords/Search Tags:Citation, Zero-inflated, Unbalanced data, identify high-frequency cited
PDF Full Text Request
Related items