Statistical Analysis on COVID-19

Background: Since receiving unexplained pneumonia patients at the Jinyintan Hospital in Wuhan, China in December 2019, the new coronavirus (COVID-19) has rapidly spread in Wuhan, China and spread to the entire China and some neighboring countries. We establish the dynamics model of infectious diseases and time series model to predict the trend and short-term prediction of the transmission of COVID-19, which will be conducive to the intervention and prevention of COVID-19 by departments at all levels in mainland China and buy more time for clinical trials. Methods: Based on the transmission mechanism of COVID-19 in the population and the implemented prevention and control measures, we establish the dynamic models of the six chambers, and establish the time series models based on different mathematical formulas according to the variation law of the original data. Findings: The results based on time series analysis and kinetic model analysis show that the cumulative diagnosis of pneumonia of COVID-19 in mainland China can reach 36,343 after one week (February 8, 2020), and the number of basic regenerations can reach 4.01. The cumulative number of confirmed diagnoses will reach a peak of 87,701


Introduction
Since December 2019, many unexplained cases of pneumonia with cough, dyspnea, fatigue, and fever as the main symptoms have occurred in Wuhan, China in a short period of time [1,2]. China's health authorities and CDC quickly identified the pathogen of such cases as a new type of coronavirus, which the World Health Organization (WHO) named COVID-19 on January 10, 2020 [3].On January  [4].Wuhan, China is the origin of COVID-19 and one of the cities most affected by it. The Mayor of Wuhan stated at a press conference on January 31, 2020 that Wuhan is urgently building Vulcan Mountain Hospital and Thunder Mountain Hospital patients will be officially admitted on February 3 and February 6 [5]. By 24:00 on have issued varying degrees of closures and traffic restrictions [6].
In fact, there are many imminent questions about the spread of COVID-19. How many people will be infected tomorrow? When will the inflection point of the infection rate appear? How many people will be infected during the peak period? Can existing interventions effectively control the COVID-19? What mathematical models are available to help us answer these questions? The COVID-19 is a novel coronavirus that was only discovered in December 2019, so data on the outbreak is still insufficient, and medical means such as clinical trials are still in a difficult exploratory stage [7]. So far, epidemic data have been difficult to apply directly to existing mathematical models, and questions need to be addressed as to how effective the existing emergency response has been and how to invest medical resources more scientifically in the future and so on. Based on this, this article aims to study the gaps in this part.

Data
Recently, COVID-19 suddenly struck in Wuhan, the seventh largest city of the People's Republic of China. The daily epidemic announcement provides us with basic data of epidemiological research. We obtained the epidemic data from the National Health Commission of the People's Republic of China from January 10, 2020 to February 9, 2020, including the cumulative number of cas-es, the cumulative number of suspected cases, the cumulative number of people in recovery, the cumulative number of deaths and the cumulative number of people in quarantine in the Chinese mainland [8]. At the same time, we collected the epidemic data of Hubei Province and its capital city Wuhan from the Health Commission of a Province from January 20, 2020 to February 2, 2020, including the cumulative number of cases, the cumulative number of recovered people and the cumulative number of quarantined people in Hubei Province and Wuhan [9].

The Model
Based on the collected epidemic data, we tried to find the propagation rule of the COVID-19, predict the epidemic situation, and then propose effective control and prevention methods. There are generally three kinds of methods to study the law of infectious disease transmission. The first is to establish a dynamic model of infectious diseases; The second is statistical modeling based on random process, time series analysis and other statistical methods.
The third is to use data mining technology to obtain the information in the data and find the epidemic law of infectious diseases [10].
Considering the shortage of the collected public data in time span, the research content of this paper is mainly based on the first two kinds of methods. The spread of the COVID-19 has exploded rapidly in Wuhan, China, and effective government intervention and prevention and control measures in all sectors depend on the best possible outbreak prediction [11].This paper mainly builds a dynamic model of COVID-19 transmission and a statistical model based on time series analysis, and compares the prediction effects of these mathematical models on the spread of COVID-19 epidemic. Due to the outbreak of existing data is not relatively large sample data, in the spread of COVID-19 at this stage, the dynamics model we built is more suitable for containing parameters to be estimated to predict the development trend of epidemic, peak size, etc., based on time series analysis of statistical modeling is more accurately predict the value of data in the short term.

SEIQDR-Based Method for Estimation
After the outbreak of the COVID-19 epidemic, the Chinese government has taken many effective measures to combat the epidemic, such as inspection detention, isolation treatment, isolation of cities, and stopping traffic on main roads [12][13][14]. However, the traditional SEIR model cannot fully describe the impact of these measures on different populations. Based on the analysis of the actual situation and existing data, we divided the population into different warehouses and established a more effective model for the dynamic spread of infectious diseases. According to the actual situation of the epidemic, we divided the population into 6 different categories to comply with the current spread of COVID-19 in China. Seeing Table 1 for specific classification [15][16][17].
Since the incubation period of the COVID-19 is as long as 2 to 14 days, there are already infected but undetected people (E) in the natural environment of the susceptible population (S), when the first case is identified. Some people who have been infected need to go through a certain incubation period before suspected symptoms can be detected (Q). Chest CT imaging was used to observe whether there were glassy shadows in the lungs to determine whether the diagnosis was confirmed (D). Another part of the population has been infected and has been sick, because not isolated, is highly infectious in the population. After a period of quarantine treatment, these two groups of people will be discharged from hospital (R), or face death due to basic diseases, based on these, we classify the population as shown in Table 1.  As shown in Figure 1, we set up the warehouse in this way to help us build a clear and accurate COVID-19 transmission dynamics model. The diagnosed patients will become healed after a certain period of isolation and treatment. We call the proportion of people who are cured per day to those who are diagnosed the cure rate γ, which reflects the local level of care and, to another extent, the difficulty of the condition. And δ is the fatality rate of the new pneumonia, reflecting the lethal intensity of COVID-19. The rate d qd at which suspected patients are converted into confirmed cases represents a measure of quarantine intensity due to the constant changes in medical procedures. At the same time, some highly infectious people in the free environment will be transferred to confirmed cases at the rate of d id , while others will be moved out at the rate of δ 1 due to lack of timely treatment.
The incidence rate of the susceptible population S(t) was set as f(t), which to some extent can reflect the infection degree of COVID-19 in the susceptible population. The susceptible population in the free environment will become latent after being infected by COVID-19, and gradually develop after the incubation period.
The proportion of latent persons who were converted to free infection was ε, and the proportion identified as suspected cases was d eq . After medical diagnosis, some of the suspected cases were confirmed, while others were not detected and returned to the susceptible population with a ratio of d qs . The susceptible population has also been converted to suspected cases at a rate of d qs . According to the above-mentioned population classification and parameter definitions, we have established a SEDQIR model based on SEIR, which can better reflect the spread of the COVID-19 in the population [18].
The SEDQIR model established based on the dynamic transmission mechanism of infectious diseases is as follows: In order to study the deeper COVID-19 transmission rule, we perform a detailed analysis of some parameters to transform the degree of infection into a form more conducive to data expression [19]. Adopt the degree of infection of COVID-19 in susceptible populations f(t), the mathematical expression is as follows: Among them, we refer to the infection rate coefficients of latent and freely infected people in susceptible populations as β E and β I.
At this stage, the epidemic caused by COVID-19 may still be in the early stages of spreading among the population. We need to fit and estimate the above parameters through the original data published by the National Health Commission of China. Therefore, we will formulate the formula to a certain extent. Simplify: The infection rate β(t), can be estimated and fitted based on the existing data, and k value reflects the infectivity of the latent person relative to the infected person. Furthermore, according to the definition of incidence, the rate of infection can be expressed by the number of people diagnosed over a period of time [10]. If the number of people diagnosed on day t is F, the infection rate can be expressed as and d 2 is the time during which the incubator is isolated after the incubation period. Based on the available data, the infectious rate can be calculated numerically.

TS Model-Based Method for Estimation
Both the exponential smoothing method and the ARIMAX model are time series analysis methods, and these methods are often used in statistical modeling to analyze changes that occur over time [20,21]. In order to accurately predict the number of con- two Term exponential smoothing value [22]. The linear exponential smoothing formula using Brown's linear exponential smoothing model is as follows: Brown's linear exponential smoothing formula is as follows: Among them, f t (1) is the exponential smoothing value of the model, f t (2) is the quadratic exponential smoothing value of the model [23]. The Brown linear exponential smoothing model using these two smoothed values is as follows: In the above formula, m is the number of lead periods.
Time series analysis based on exponential smoothing is a statistical modeling method that cannot consider input variables. Considering that the COVID-19 confirmed cases, cumulative deaths, and cumulative recovery variables may have some relationship in value, the time series analysis based on ARIMAX models will also be performed on these variables in Mainland China, Hubei Province, Wuhan City and some surrounding cities, and these time series models will be compared and selected according to the test and prediction effects of statistics [24]. In order to make accurate prediction of 2019 nCoV in the population, the formula of ARIMAX model is as follows The ARIMAX model can be seen as an ARIMA model with an

Simulation
Due to the long duration of the COVID-19 epidemic, it is still in an ascending period as of February 9. In this paper, days are taken as the minimum time unit, a discrete model is obtained according to the practical meaning of the continuous model, and the epidemic data of Hubei province and the whole country are used to obtain the changes of its parameters, and the numerical simulation is carried out. If the number of days is taken as the minimum time unit, the continuous model can be discretized as: According to the discrete model, the initial value of each variable can be given to describe the model numerically when the parameters are determined. We give the initial values of some parameters according to the references and the latest outbreak information, and we will use the least square method to obtain the variables and parameters that cannot be determined. According to the literature and the average onset period is 1 day after the incubation period.
The average disease duration is 21 days, and the mortality rate based on historical data, it is about 2%. The daily conversion of suspected cases to confirmed cases accounts for about 0.8, and newly admitted patients account for about 0.2 confirmed cases per day [25]. Therefore, the following parameters can be preliminary estimated, these parameters reflect the basic situation of the epidemic at this stage.

TS Model-Based Estimates
We use sequence diagrams and autocorrelation functions of the original data to determine the stationarity of these time series, and to smooth the series who's average and variance are not always constant. In the exponential smoothing method, we perform a natural logarithmic transformation on the series to omplete the smoothing process. In the ARIMA and ARIMAX models, we use the first-order difference or the second order difference to smooth the original sequence. Using the above processing, we can obtain the time series analysis model summary information of the number of confirmed cases in mainland China as shown in Table 2. As shown in       We also made six time series models of cumulative confirmed cases in Hubei Province as shown in Table 3 below and the autocorrelation function diagrams of the residuals of these 6 models are shown in Figure 5. Based on the information in Table 3 & Figure 5,

SEIQDR-Based Estimates
According to the data released by the National Health Construction Commission of China, we set the data on January 10 as the initial value. On January 10, the transmission of COVID-19 only occurred in Hubei Province, of which 41 were confirmed, 0 were suspected, 2 were cured, 2 were infected with the COVID-19 but not yet sick, and 0 people were ill but not isolated, namely:      Figure 10. In Figure   10, The reciprocal of σ indicates that it takes the average time for the suspected population to be diagnosed as the confirmed population, that is σ = d qd , as shown in Figure 10, at that time of σ =1/ 5 , it took an average of 5 days for suspected cases to be diagnosed as confirmed cases for necessary isolation and treatment. If the various preventive measures remain unchanged, the cumulative number of confirmed patients in mainland China will reach a peak after 73 days on January 1, 2020. The peak time was 94731 people.
At that time of σ = 1/ 3 , patients can get relatively timely isolation and treatment after the onset of the disease, in which case the peak number of cumulative diagnoses will be reduced to 87701, a relative decrease of 7030. According to Figure 10, we can find that the peak number increases as σ gets smaller of cumulative diag-noses, which means that if the number of patients diagnosed with suspected patients will increase rapidly if they are not diagnosed in time. We believe that this trend may not be obvious enough within 30 days after January 10, however, once the epidemic situation becomes serious, the rapid increase in the number of confirmed cases and the difficulty in timely diagnosis and treatment may bring great challenges to the prevention and control of the epidemic in mainland China. Finally, we analyzed the sensitivity of mortality γ and cure rate δ in SEIQDR model. The Figure 11 shows the change in the cumulative number of confirmed cases with mortality and cure rate in mainland China. In Figure 11, on the left is the cumulative num-

Discussion
There is no doubt that the propagation of COVID-19 in the population will be affected by the intricacies of many factors. In the early stage of the COVID-19 propagation, it is difficult to establish a dynamic propagation model with parameters to be estimated and obtain fairly accurate simulation results, but the preliminary estimation of parameters such as average latency and mortality through existing data may be helpful for solving important parameters such as infection rate and rehabilitation rate, which will help us have a more accurate grasp of the transmission trend of COVID-19.
On the other hand, statistical modeling of the spread of new coronavirus pneumonia in the population based on time series analysis is a thing that can be done immediately after getting the latest data every day, because the dynamic model of the time series is based on the law of the data itself. Although this method often requires sufficient data to support it, in the early stages of epidemic transmission, this method can still be used to more accurately predict the indicators of epidemic transmission in the short term, so as to provide intervention control at all levels of the departments and Policy implementation provides short-term emergency prevention programs.

Limitations
This article will inevitably make some assumptions when building the model. When we build a dynamic discrete model for a certain period of time for COVID-19, we ignore the impact of factors such as population birth rate and natural mortality. For simple calculations, we also Assume that the latent population of COVID-19 and the infected but not yet isolated population have the same range of activities and capabilities, that is, we assume that for COVID-19, the population E(t) And the crowd I(t ) have the same contact rate. On the other hand, this article is based on the collected data for a specific period of time to fit and estimate the basic regeneration number, infection rate, and recovery rate of COVID-19, with the continuous release of epidemic data these important indicators may undergo significant changes in the spread of COVID-19 among the population.

Conflict of interest
We have no conflict of interests to disclose and the manuscript has been read and approved by all named authors.