| The 2022 Data Breach Cost Report,conducted by the Ponemon Institute and sponsored by IBM Security,provides an in-depth analysis of real data breaches experienced by 550 organizations worldwide between March,2021 and March,2022.The economic loss and impact caused by network data breach to enterprises,government agencies and healthcare systems have reached an unprecedented height,and data breach has become a lingering nightmare for economic entities.Data breaches in healthcare have become a substantial concern.Healthcare as the industry with the highest average cost of data breach,its average cost has exceeded tens of millions of dollars.Data breaches in healthcare have become a substantial concern in recent years.It is fundamental for government regulators,insurance companies,and stakeholders to understand the breach frequency and the number of affected individuals in each state,as these are directly related to the federal Health Insurance Portability and Accountability Act(HIPAA)and state data breach laws.However,an obstacle to studying data breaches in healthcare is the lack of suitable statistical approaches.We develop a novel multivariate frequencyseverity framework to analyze breach frequency and the number of affected individuals at the state level.A mixed effects model is developed to model the square root transformed frequency,and the log-gamma distribution is proposed to capture the skewness and heavy tail exhibited by the distribution of numbers of affected individuals.We further discover a positive nonlinear dependence between the transformed frequency and the log-transformed numbers of affected individuals(i.e.,severity).In particular,we propose to use a D-vine copula to capture the multivariate dependence among conditional severities given frequencies due to its inherent temporal structure and rich bivariate copula families.The rejection sampling technique is developed to simulate the predictive distributions.Both the in-sample and out-of-sample studies show that the proposed multivariate frequency-severity model that accommodates non-linear dependence has satisfactory fitting and prediction performances.In addition to understanding the frequency and severity of data breaches at the state level,we are also concerned with the dynamic process by which data breaches occur and their lifecycle.To mitigate the damage caused by the data breach,a key concept of the data breach lifecycle involving three components the occurrence of a breach,the time to detect the breach,and the time to report the breach,has to be well understood.This work initializes the statistical modeling of the data breach occurrence and lifecycle via a self-exciting marked point process.The proposed model accommodates the heterogeneity between hacking and non-hacking events,and the dependence between two marks-the time to detect the breach and the time to report the breach is modeled via a copula approach.The missing and censoring mechanisms are taken into account in the modeling process.Empirical study demonstrates the proposed approach’s satisfactory fitting and predictive performance,offering valuable insights into the data breach lifecycle and a useful statistical modeling tool. |