| With the rapid development of Web applications,the security of Web sites has also become a focus of common concern.The primary task of improving the security risk level of a website is to find the vulnerabilities that exist on the website,in order to prevent criminals from using the Web vulnerabilities to attack the website and resulting in website information leakage and property loss.Among them,full-site scanning is the most cutting-edge task of Web vulnerability scanning.It can help us understand the website’s directory structure,interface and other related information,and then it can easily determine the related versions and types of the website’s operating system,middleware and database.And it is helpful to discover security issues such as website vulnerabilities and architectural defects.The web crawler is a program that can automatically crawl web page data according to people’s needs.We can use crawler technology to collect all links on the website to achieve the purpose of scanning the whole site.This article designs a web crawler based on Chrome Headless for web vulnerability scanning.It uses Chrome Headless and Puppeteer to provide an interface to control the crawler function,so as to achieve the whole site crawling of the target site,and obtain the URL(uniform resource locator)of the subsite as much as possible.Through accurate deduplication and search algorithms,the crawling task can be completed efficiently under the premise of saving resource consumption.Secondly,the external links in the target site are detected and blocked to prevent crawlers from entering the public network or external domains for crawling,and to avoid security access disputes and bandwidth occupation.For Java Script sites,it is driven by events,using all the JS of Chrome Headless page to run simulation.In the event simulation process,analyze and crawl the pop-up window links,jump links,and pages after the loading is completed,so as to avoid many missing crawling behaviors.The main work of this article is as follows.1.Study the interface of Chrome Headless and Puppeteer to control crawlers to crawl URL,complete the simulation of JS,and simulate normal user login actions.2.Study page parsing methods to obtain links,and perform homologous processing and reconstruction processing on the collected links to facilitate analysis of the structure of the website.3.Research the search traversal algorithm to avoid the problems of missed crawling,external links and repeated crawling.4.Perform vulnerability detection on the collected URL and generate detection results.Through experimental tests,it shows that the crawler designed in this paper can efficiently crawl all links of the specified website,support the scanning of complex sites,analyze the directory structure of the website and conduct vulnerability detection on the crawled URL.It has high practical value for Web security maintenance and Web vulnerability scanning. |