Font Size: a A A

Research On Critical Data Mining Methods And Applications Of Internet-based Textual Big Data For Product-related Child Injury

Posted on:2023-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:W X XiaoFull Text:PDF
GTID:1524307070990479Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective:Product-related child injury is a significant public health and safety issue.Internet-based textual big data could provide supplemental and valuable information than traditional product-related injury surveillance system,which is of great significance to product-related child injury prevention and control.However,methodological challenges of accurately defining,filtering,extracting and exploring the information on product-related child injury prevention and control from chaotic internet-based textual big data are still unsettled.Based on the theories and methods of injury epidemiology and big data,this study drew up to establish the theoretical methods and technical models of filtering,automatically classification and information extraction,and explore the major implication values of internet-based textual big data for product-related child injury.Methods:(1)Textual retrieval,filtering and information extraction standards formulationBased on definition and basic elements of product-related child injury,Haddon Injury Matrix,International Classification of Diseases 10thRevision(ICD-10),product-related technical standard and relevant product quality and safety laws and regulations,data retrieval strategies,criteria for inclusion and exclusion of data filtering,textual feature with its epidemiological classification criteria and the database of key words related to product-related child injury were established by literature review,focus group discussions and expert consultations.(2)Text classifier constructionWe randomly selected 10,000 news for manual classification annotation and data pre-processing,and allocated them into training,validation and test textual sets at a ratio of 8:1:1.By comparing the characteristics of different text classification algorithms,we chose BERT pre-training model to construct the text classifier suitable for this study.Finally,the precision of the text classifier was calculated to evaluate the effectiveness of the model.(3)Textual information extraction model constructionWe randomly selected 1,000 news on product-related child injury,and conducted the manual information extraction annotation and data pre-processing.Based on the textual feature with its epidemiological classification criteria,we constructed the information extraction model by employing regular expression,named entity recognition,keyword matching and dependency syntactic parsing.Finally,the quality of information extraction model was evaluated by calculating the precision of information extraction.(4)Automatic capture and crucial application values of big dataWe developed the automatic capture program to timely,dynamically,automatically and systematically capture the news on internet-based textual big data for product-related child injury by web crawler.We then evaluated and explored the crucial application values of internet-based textual big data from the aspects of data coverage,timeliness and richness.Results:(1)Textual retrieval,filtering and information extraction standards formulationFive news retrieval strategies were developed,including limited text language,retrieval time,retrieval object,retrieval range of fields and keyword.Seven inclusion criteria and six exclusion criteria were formulated,including whether being relevant to product-related child injury,in Chinese,injury happened in mainland China,etc.or not,to filter the eligible news.A total of 29 textual features were formulated in the textual feature extraction and epidemiological classification criteria including time,region,and product type of injury events.After expanding of the synonyms/synonyms/network words,a database of keywords relevant to product-related child injury was developed,including children,product,injury events,environments,behavior and others.(2)Text classifier constructionWe constructed the text classifier suitable for product-related child injury by employing BERT pre-training model after manual classification annotation and data pre-processing on the randomly selected 10,000 news.The precision of text classifier has reached 93.55%.The accuracy of text classifier for test and validation text has reached 97.03%,and 96.80%,respectively.And the F1 value of text classifier for test and validation text has reached 97.00% and 96.79%,respectively.(3)Textual information extraction model constructionWe developed the information extraction model suitable for productrelated child injury based on 925 news after checking and manual annotating from 1000 news related to product-related child injury by randomly selecting.The evaluation on information extraction model showed that the precision of information extraction for 25 variables has all exceeded 70% among 29 variables excepting the variables of product characteristics(62.77%),injury clinical diagnosis(44.94%),productrelated preventive measures or suggestions(36.73%)and injury event cause descriptions(4.81%),and that for 9 and 14 variables among them was 70%-80% and 80%-90%,respectively.The precision of information extraction for number of injured children and time of injury event has both exceeded 90%.(4)Automatic capture and coverage of internet-based big dataA total of 23,643 news on product-related child injury in January 1,2010 to December 31,2021 were included in this study.We included 51 news media websites,two main social media platforms and other relevant platforms.It involved 9,935 accounts of news media websites and social media platforms.The geographical coverage rate of all the included news has reached 95.22%(293/297)among prefecture-level cities.(5)Timeliness evaluation of internet-based big dataFrom the timeliness evaluation of two examples of social epidemic product-related child injury events(magnetic ingestions and electric selfbalancing scooter crashes)that are not covered by the official statistics currently,we have the following findings.The websites and platforms included in this study announced the first product-related child injury event related to magnetic ingestions in China on September 14,2015,which was 5 years earlier than the first product recall notice of Defective Product Administrative Center of State Administration for Market Regulation on December 18,2020.And it was more than one year earlier compared with the relevant literature published on November 25,2016.This study captured the first product-related child injury event related to domestic electric self-balancing scooter crashes on July 3,2016,which was 3 years earlier than the first product recall notice of Defective Product Administrative Center of State Administration for Market Regulation on August 2,2019.And it was more than half year earlier compared with the relevant literature published on February 25,2017.(6)Richness evaluation of internet-based big dataCompared with the product-related child injury variables from the National Product-related Injury Surveillance Table released by Chinese CDC,this study has supplemented 15 additional variables including the cause of injury events related to human.The two social epidemic productrelated child injury events related to magnetic ingestions and electric selfbalancing scooter crashes have added the data relevant to the cause of injury events including inapplicability for specific children,improper supervision of child caregivers,improper use and installation of products.In addition,the title and main body of the included news involved 21,793 and 208,454 keywords,and the accumulative word frequency has reached 228,191 and 9,848,482,respectively.Conclusion:(1)This study developed the retrieval strategies,inclusion and exclusion criteria,textual feature extraction and epidemiological classification criteria,and the database of keywords on product-related child injury news,filling the gap in the standard of retrieval,filtering and information extraction of internet-based textual big data for productrelated child injury in China.(2)This study established the text classifier suitable for product-related child injury news with high classification precision.In the future,the classification precision can be further improved by expanding the annotated news and formulating more accurate rules on text classification.(3)This study constructed the textual information extraction model suitable for product-related child injury news with high precision of information extraction on the crucial variables.In the future,the precision of information extraction can be further promoted by expanding the manual annotated news and formulating more accurate rules on textual information extraction.(4)The internet-based textual big data platform was tested stable,and could timely,dynamically,automatically and systematically capture the news on product-related child injury.And the captured news has good regional representative.This study complemented the variables that have not yet been monitored by relevant government,and enriched the data including product-related injury event types and risk factors which have not been covered by relevant government,and that can been timely discovered and we take the prevention and control measures before the event occurrence,providing clues and references for the government to develop interventions on product-related child injury.
Keywords/Search Tags:Internet-based big data, Child, Product-related injury, News media, Social media, Textual analysis, BERT
PDF Full Text Request
Related items