A Study on the Development of a Machine Learning
Prediction Model on the Spread of COVID-19

Juhwan Moon; Hongsul Lee; Andrey Molyakov; Andrey Molyakov

doi:10.26717/BJSTR.2022.45.007184

Research ArticleOpen Access

A Study on the Development of a Machine Learning Prediction Model on the Spread of COVID-19 Volume 45- Issue 2

Juhwan Moon¹, Hongsik Yoon², Hongsul Lee¹ and Jaejoon Lee¹*

¹Interdisciplinary Program in Crisis, Disaster and Risk Management, Sungkyunkwan University, South Korea
²School of Civil, Architectural Engineering & Landscape Architecture, Sungkyunkwan University, South Korea

Received: June 22, 2022; Published: July 26, 2022

*Corresponding author: Jaejoon Lee, Interdisciplinary Program in Crisis, Disaster and Risk Management, Sungkyunkwan University, Seoul, South Korea

DOI: 10.26717/BJSTR.2022.45.007184

Abstract PDF

ABSTRACT

The corona pandemic has caused many human lives and economic losses. The number of confirmed cases, which slowed since February of this year during the availability of vaccines, increased rapidly due to mutated viruses and casual enforcement of the quarantine system. The difference in this study, as compared to those done previously, is that the most recent data was used, and sufficient learning data is used for training. In addition, the number of confirmed cases was predicted based on the latest information including those who received the primary vaccine and those who were fully vaccinated. In addition, we used a predictive model with information only from confirmed corona cases, subdivided it by parameter, and tried to propose an accurate and effective predictive model for the number of corona-confirmed cases. In this study, the machine-learning model used neural networks, ensembles, distancebased models, and linear regression as supervised learning models.

As for the model with excellent predictive power, Gradient Boosting and AdaBoosting had high training scores, and CatBoost showed the best predictive power among the Gradient Boosting models through cross-validation by model. About 94.8% of the predictions were accurate. CatBoost’s predictive power was poor in the area where the number of confirmed cases rapidly increased due to the mutated virus. In particular, it was confirmed that the CatBoost model was effective in predicting small and irregular infections in the early stage, but that the prediction of the period of a rapid increase in the number of confirmed cases due to delta mutation was somewhat ineffective. As a future research task, it is necessary to implement and compare prediction algorithms using machine learning techniques trained in unsupervised learning. In addition, it is necessary to make a prediction using the policy variables to be considered, such as the stage and implementation of the social distancing movement.

Keywords: Corona Pandemic; Mechine- Learning; Prediction Model

Introduction

The coronavirus, first discovered in China in December 2019, caused a global pandemic via person-to-person transmission. In response to this pandemic, each country has established various health care policies. The need for national risk management is increasing as the coronavirus outbreak is affecting the efficacy of the health care system due to the unprecedented increase in patients, but also the survival and permanence of businesses. As of August 4, 2021, there have been more than 200 million confirmed cases and more than 4 million deaths worldwide. In addition, despite the primary inoculation of more than 40% of the coronavirus vaccine, the rate of spread according to the mutant is increasing. In this case, the importance of vaccines and therapeutics is on the rise, and the only way to prevent the spread of Corona is strict personal hygiene management and intensive social distancing, and the number of confirmed cases continues to occur even though it is being practiced steadily.

Therefore, health authorities must decide on workforce planning and policy responses within a short period of time. Hence, accurately predicting the spread of the coronavirus that will occur in the near future at a sufficiently granular level will help the authorities to provide better information and more time to respond accurately. Effective policies based on that information will be of great help, not only to prevent the spread of infectious diseases but also to secure corporate continuity. The problem of this study is to develop a model that can accurately predict the spread of the disease using machine techniques. As in previous studies, external factors (seasonal, environmental, geographical) were not used. In addition, in previous studies, realistic data could not be constructed because the number of daily primary vaccine recipients and the number of completed daily vaccinations were not utilized. The structure of this paper is as follows: Chapter 2 reviews related prior research and predictive models based on machine learning. Chapter 3 compares training scores for each model, goes through model validation, selects an optimal predictive result model, and selects an optimal predictive model by changing detailed parameters. Finally, it includes conclusions and future plans.

Prior Research

With the coronavirus spreading rapidly around the world, numerous studies have been conducted to develop a model that predicts the spread of the coronavirus. Looking at previous studies, Marvel (2020) and Metha (2020) evaluated the risk of infection in the United States through an ensemble of existing epidemiological models and machine learning techniques, respectively, as prior studies based on corona data. And Ceylan, et al. [1] proposed an ARIMA model to predict corona cases in Italy, Spain, and France, and as a result, found that the MAPE ranged from 4 to 6%. Chimmula and Zhang [2] used a deep learning model (LSTM) to predict the end of the coronavirus in Canada. conducted a study to predict the trend and end time of corona confirmed in Canada with a cyclic neural network model. Yang, et al. [3] used a model for corona prediction by combining population movement data and epidemiologic data in China. They reported that combining susceptible-exposedinfected- removed (SEIR) and LSTM models were effective in predicting the peak and magnitude of infectious diseases. He, et al. [4] simulated the spread of corona in Hubei, China through the SEIR disease spread model and particle swarm optimization algorithm. Alazab et al. showed that the Prophet model is effective in predicting the number of confirmed cases, recoveries, and deaths from coronavirus in Australia and Jordan.

Arora, et al. [5] conducted a study to predict the trend of corona confirmed in India with a circular neural network model. Pandey, Gaurav, et al. [6] conducted an experiment to predict the number of corona cases in India using an SEIR model and a regression model. Data from Johns Hopkins University [7] was used to predict the number of confirmed cases over a two-week period. RMSLE was used as the prediction result. The SEIR model achieved 1.52 and the regression model achieved 1.75, indicating satisfactory performance. Alzahrani, et al. [8] used the ARIMA model to predict the number of COVID-19 cases in Saudi Arabia. The research team used the combination of ARIMA to determine the best model fit and proposed ARIMA (2,1,1) as the most suitable predictive model for the daily number of confirmed cases in Saudi Arabia. Pinter, Gergo, et al. [9] proposed a hybrid machine learning approach to predict COVID-19. Using Hungarian data, the researchers proposed a model that combines an adaptive network-based fuzzy inference system (ANFIS) with a multi-layer perceptron competition algorithm (MLP-ICA). Looking at previous studies in Korea, Jeong (2020) used a mathematical epidemiologic model to estimate the domestic infection weight and evaluated the effectiveness of government policies, and Kim Jin-oh, et al. [10] 19 confirmed cases and deaths by country in the world in comparison and analysis of predicted cases. The SIR (Susceptible-Infected-Recovered) model is used for prediction through the epidemic model, and the curve fitting of the SIR model was performed with L-BFGS-B among the machine learning optimization algorithms.

In addition, Jin-soo Bae and Seong-beom Kim [11] proposed a methodology for predicting new confirmed cases four days later by using the information on confirmed cases of corona so far and considered legal holidays in predicting new cases of corona with a machine learning model. Myung-hui Kim (2021) proposed a deep learning-based model that combines CNN, Bi-LSTM, and Attention mechanisms to predict the number of confirmed cases for COVID-19 in a corona-confirmed patient prediction model using a deep learning-based prediction model. In addition, Hyeongju Seon (2021) predicted the daily number of confirmed cases by synthesizing external variables such as epidemiological data, demographic data, and search trends. As a result of experimenting with various models, it was proved that tree and regression-based machine learning models can predict the number of confirmed patients significantly. In addition, Seung-Yeol Lee and Myung-Ki Shin (2020) studied how to predict and control the number of confirmed cases coming from abroad in predicting the number of confirmed cases of COVID-19 using mathematical modeling. They proposed a mathematical model that can predict the number of overseas inflows, and the proposed model predicts the number of overseas inflows using roaming service data and the LSTM algorithm. There are also previous studies that investigated the relationship between climate, aviation, and web data, and corona.

First, as a study using climate and temperature, Mohammad, et al. Corona was judged as a seasonal respiratory virus and investigated how factors such as altitude, humidity, and temperature might apply. Peng Shi, et al. [12] investigated the relationship between corona and temperature based on weather and epidemiological data as an environmental factor in the outbreak of COVID-19 in China investigated the relationship between the transmission of coronavirus and ecological factors of tropical climate in Brazil and found that temperature had a negative linear relationship with the number of confirmed cases. As a study using the impact of aviation, To, et al. [13], analyzed the relationship between the number of passengers at Hong Kong International Airport and COVID-19. Coelho et al. tested the effects of climate and aviation for the prediction of COVID-19. Kumar, et al. [14] explained that India is a country in which mobility between countries is very diverse and infection cases vary dynamically from region to region. For accurate prediction, they studied population migration data and monthly data of airline passengers. Researchers at KAIST in Korea [15] have designed a Hi-COVID Net for monitoring COVID-19. The proposed model solved the problem of monitoring inbound travelers in each country and predicting cases of COVID-19 coming from abroad. It also showed practicality and effectiveness through real-world experiments and predicted the number of imported COVID-19 cases in the future much more accurately than the baseline. Finally, as a study using web data, Qin, Lei, et al. [16] proposed a model for predicting the number of cases through the Social Media Search Index (SMSI) for COVID-19. Kia Jahanbin, et al. [17] proposed the FAMEC system to collect unstructured data from COVID-19 on Twitter to monitor the spread of the epidemic. Li, Civilian, et al. [18] predicted epidemic outbreaks in China using Internet searches and social media data. Based on the fact that web data occurs earlier than the spread of COVID-19, they developed a model to monitor a new epidemic using Google Trends and Baidu Index and found that the data had a high correlation with real COVID-19 data [19].

Predictive Model

Looking at the prediction methodology for predicting corona confirmed, a number of prior studies made predictions by using the hydrodynamic methodology first. Recently, mathematical mechanics and machine learning have been applied to epidemiological research. Mathematical mechanics has been used as an effective tool for monitoring and predicting the prevalence of infectious diseases. Still, machine learning is being used because of problems that cannot be solved due to too many variables and computational amounts. In epidemiological information, population information of three groups (S (Sensitive) group, I (Infected) group, and R (Recovered) group) is used. The three groups are called SIR using acronyms. Group S is the sum of the population at risk of infection because they do not have immunity to the disease and the population of the infected who do not know that they are infected yet. Group I is the population that recognized and confirmed the infection. Group R refers to the population in a state in which the disease is cured or dies and will not become ill or become a spreader. The second is a machine learning-based prediction model, which is a Long Short-Term Memory (LSTM) model applied as well as a neural network model. LSTM is a structure in which a hidden layer unit is added as an LSTM cell in the existing RNN structure. The LSTM cell determines the data output by weighting the value of each cell when the distance between the input data and the output data of the previous step increases. A decision tree is a model that makes predictions using several decision rules of a hierarchical structure. Random forest, which is one of the ensemble models using bagging, is a method of calculating a final prediction value by learning multiple decision trees and then synthesizing (voting, averaging, multiplying, etc.) the result values of each decision tree. A model combining several decision trees has better prediction performance than a single decision tree.

The advantage of integrating the random forest into multiple decision tree models is that even if some decision trees make incorrect predictions, accurate prediction is possible by synthesizing the prediction results of multiple decision tree models. Gradient boosting is the same as random forest in that it predicts by synthesizing the prediction results of several decision trees, but the most prominent feature is that it trains decision trees sequentially. The next decision tree learns the error that the previous decision tree incorrectly predicted, and the subsequent decision tree gradually reduces the error. The model structure and searched hyperparameters must be adjusted. XGBoost, an improved model of Gradient Boosting, is one of the ensemble models using boosting. Because it is based on CART (Classification and Regression Tree), it can be used for both classification and regression problems like the random forest. Since it uses the same methodology as Gradient Boosting, the weights among several internal models are determined by gradient descent. The K-nearest neighbor method is generally used for classification problems, but it can solve regression problems with the same principle as classification. In the same way as the KNN classification model, the distance between each variable in the N-dimensional space is calculated. (In this case, N is the number of independent variables used.) In the subsequent classification problem, the categories of the K nearest neighbors are taken by the voting method, but in the regression problem, the average of the values of the K nearest neighbors is taken.

In this case, fine adjustment is possible through the formula for calculating the distance and the weighted average. Tree-based models used in this study include Tree model, Random Forest, and Gradient Boosting models, Extreme Gradient Boosting (XGBoost), Gradient Boosting (Sickit-Learn), Extreme Gradient Boosting Random Forest (XGBoost), and Gradient Boosting (CatBoost). AdaBoost was also used. Linear Regression was used as the regression-based model. Ridge Regression (L2), Lasso Regression (L1), and Elastic Net Regression were used. The distance-based model was used for each metric option as a KNN regressor. Prediction after learning was attempted using SGD, SVM, and Neural Network. However, their result predictions were either too unstable or under-fitted, so they failed to learn. Therefore, KNN, Tree, Neural Network, Random Forest, Gradient Boosting, Linear Regression, and AdaBoost were used in this study. The hardware and software used were ANACONDA.NAVIGATOR’s Orange 3 3.26.0, RStudio, and Excel. The limitation of previous studies is that the existing prediction models generated a prediction model with only daily data of a simple corona confirmed patient and divided it into sections to improve the prediction power. The overall predictive power can be high, but the realism of the forecast is insufficient. In addition, since infectious diseases such as corona have nonlinearities, there are limits to accurate predictions assuming a simple data set is used Previous studies used a univariate regression model using only the variables of confirmed coronavirus cases.

They did not reflect changes over time, such as vaccinated persons (the number of primary vaccinated persons including those released from isolation and the number of persons who completed vaccination). In addition, it was not possible to secure adequate data for sections where the number of confirmed cases changed rapidly and irregular sections, so the reliability to evaluate the prediction performance was insufficient. As a past study, recent trends could not be utilized. In this study, realistic data were constructed using the number of daily primary vaccinations and the number of completed daily vaccinations. External factors (seasonal, environmental, geographical factors) used in the previous studies, mainly corona analysis in 2020, are gradually not being used. Propagation and diffusion due to external factors should be identified and used as parameters, but this study does not consider external factors. In many previous studies, temperature and regional characteristics were used by utilizing the information that seasonal factors, climate, environment, and geographical factors are related to the spread of corona. However, it was excluded in this study because it is not a highly correlated part due to 2 years of group learning of the corona pandemic. Instead, the number of daily deaths, the number of daily testers, and the infection rate compared to the testers were reinforced to increase the predictive power. In addition, the data set was composed of a balanced data set because uncertainty in the data could lead to the failure of machine learning model learning. Unlike previous studies, various models (Gradient Boosting, XGBoost, CatBoost, etc.) were used to increase the prediction accuracy.

Data Collection and Preprocessing

The original data used in this paper was obtained from the Johns Hopkins University public data set (https://github.com/ CSSEGISandData/COVID-19) and coronaboard.kr. The data were from March 04, 2020, up to August 04, 2021, (Table 1) shows the contents of target variables and feature variables with the collected data.

Table 1: Variables and Configurations.

In this study, Input Total Data is 519 instances, Training Data is 416 instances, while Test Data is 103 instances, and they were divided 8 to 2. For cross-validation data for each model, 10-folds or 20-folds of training data were used. KNN, Tree, Random Forest, Gradient Boosting, Linear Regression, AdaBoost, SVM, Neural Network, etc., are used as models. (Table 2) shows the descriptive statistics of the variables used in this study. The x-axis is Date (start=2020-03-04) and corresponds to the distribution of KD_ Cnfrm, KD_Dth, KD_Rcvr, KF_Vac, KD_Exm, C_Cnfrm, Under_Care, Gd_Cnfrm, and GD_Rcvr data in column units. (Figures 1 & 2) shows the correlation of variables visualized by heatmap. The correlation between the number of daily tests (K_D_Exm) and the number of daily confirmed cases (K_D_Cnfrm) was 0.87, which was high. The correlation between the number of patients during daily treatment (Under_Care) and the number of examiners per day (K_D_Exm) was also high at 0.92. In addition, the correlation between the number of daily confirmed cases (K_D_Cnfrm) and the number of patients during treatment (Under_Care) was high at 0.87.

Table 2: Variables and Configurations.

Note: The data used in this study is visualized as a date.

Figure 1: Scatterplot by Date.

Figure 2: Correlation.

Prediction Results

Model selection

In this study, the machine learning model was trained using 80% of the training data in the set format.

Test on Training Data: The machine learning models used for training include KNN, Tree, SVM, SGD, Random Forest, Neural Network, Linear Regression, Gradient Boosting, and AdaBoost. The machine learning model results using the training data are shown in (Table 3) below. In the table above, SVM (Support Vector Machine), SGD (Stochastic Gradient Descent), and Neural Network were not trained. Therefore, it did not fit the model of this study. First, in the case of SGD, the values of MSE, RMSE, MAE, and R_Squared, which are indices representing the explanatory degree of the model, do not converge to 2.290E+62, 1.513E+31, 1.513E+31, and -1.598E+57. Therefore, it cannot be used in this study. Also, it can be seen that the values of MSE, RMSE, MAE, and R_Squared of the Neural Network do not converge to 150853.729, 388.399, 291.663, and -0.053. This model cannot be used in this study. In SGD, Hinges were selected for Loss Functions, ε was set to 1.10, Regression was set to Squared Loss, and ε was also set to 0.10. In Regularization, Strength(α) was 0.00001, and Ridge(L2) was relatively suitable, but the learning effect did not converge. In the SVM (Support Vector Machine) model, SVM and v-SVM were used. In the former, cost(c) was set to 1.00, and Regression Loss Epsilon(ε) was set to 0.10. The regression cost(c) of v-SVM was set to 1.00, and the complexity bound(v) was set to 1.00. Also, SVM’s Kernel had better RBF than Linear, Polynomial, and Sigmoid. The kernel in v-SVM uses Linear. This was better explained than RBF, Sigmoid, and Polynomial. Numeric Tolerance of Optimization Parameters was set to 0.0010. However, this model also showed no tendency to converge. In the case of Neural Network (NN), the number of Neuron in Hidden Layers was 1000, and among the activation functions, ReLu was relatively superior to tanh, Identity, and Logistic. Among the solvers, Adam and SGD did not converge with the ReLu function, and L-BFGS-B provided the most appropriate value. Regularization, α = 0.002, Maximal Number of Iterations was set to 200, and replicable training was set. However, the overall model fit was very poor. These three models were not used because they did not fit the model of this study.

Table 3: Performance of Test on the training Data.

Results by Model

In the tree model, the values were obtained by dividing them by parameters first. For the Induces Binary Tree and Min, the number of Instances in Leaves is set to 2, and the Do not Split Subsets smaller than is set to 5. Limit the Maximal Tree Depth to was set to 1000 and calculated by giving other options. In Classification, Stop when Majority Reaches [%] was set to 90. (Table 4) shows that Training Score was 0.995 to 0.996, and Predictions dropped from 0.929 to 0.930. In the KNN model, the Number of Neighbors is set to 3. (Table 5) shows that the Training Score was 0.962 to 0.981, and the learning was good. Predictions showed reasonable predictive power from 0.875 to 0.947. Looking at the settings and results of Linear Regression, Fit Intercept was set as the parameter, while regularization is the same as in (Table 6), and the resulting values were obtained. (Table 6) shows the Training Score and Predictions of Linear Regression, and the Training Score is 0.853. It also shows under-fitting. The predictive power was 0.851, which was also under-fitting in the predictive power, and the predictive power was relatively low.

Table 4: Tree’s Training Score and Predictions.

Table 5: Training Score and Predictions of KNN.

Table 6: Training Scores and Predictions.

As a basic property of Random Forest, the Number of Trees is set to 10, as shown in the table below. The training was divided by parameters. By default, the Random Forest condition and Number of Attributes Considered are each Split were set to 5, and the Balance Class Distribution option was chosen, but it was not used because it was a factor that lowered the value. Using the Replicable Training function, the results were different each time. In the Growth Control function, the Limit Depth of Individual Trees was set to 3, while the Do not Split Subsets Smaller than was set to 5. The resulting values are shown in (Table 7). Training Scores and Predictions of Random Forest show that training scores ranged from 0.921 to 0.989, indicating overfitting. The predictive power is 0.871, 0.943, which is the best fitting in the predictive power, and the predictive power is high. 3.3.

Table 7: Training Score and Predictions of Random Forest.

Comparison of Model Performance

The training score for the training data shows that the model learning rate and AdaBoost show 1.000 over-fitting for MSE 62.952, RMSE 7.934, and MAE 2.514 R2. Gradient Boosting MSE 2.815 RMSE 1.678 MAE 1.175 R2 1.000 is also showing over-fitting. In terms of predictive score, AdaBoost is less accurate with MSE 14414.078. The explanatory power of R2 also fell to 0.916. On the other hand, Gradient Boosting is also high at MSE 9441.362, but R2 is relatively high at 0.945 (Table 8). In the case of Stratified Shuffle Split with 20 Random Samples for 80% Training Data, AdaBoost’s MSE is 7253.901, and R2 is 0.943. The learning error was the smallest with MSE 6930.701 of Gradient Boosting. The model explanatory power R2 was also the best at 0.946. Gradient Boosting was the best in terms of predictive power, with an MSE of 9441.362 and an R2 of 0.945 (Table 9). Comparing the Training Score and Prediction Score according to the cross-validation condition, the training score of 10- fold cross-validation is MSE 6530.401, RMSE 80.811, MAE 51.575, and R2 0.954, as in the above case, for AdaBoost. Gradient Boosting is MSE 5819.615, R2 0.959. The training score for 20-fold crossvalidation was AdaBoost MSE R2 0.952, and Gradient Boosting MSE 6222.771, R2 0.957.

Table 8: Training score for the training data.

Table 9: Stratified Shuffle Split.

The training score of 10-fold cross-validation was higher. However, the training score of 10-fold cross-validation and the training score of 20-fold cross-validation were the same, and in prediction score, MSE 8977.716 RMSE 94.751 MAE 60.367 of gradient boosting, and explanatory power R2 was the best with 0.948 (Table 10). (Figure 3) is a graph visualized by the actual K_D_Cnfrm data remaining at random and the predicted model. The upper left graph shows K_D_Cnfrm - kNN; The lower left graph shows K_D_Cnfrm - Tree, and the upper right graph shows K_D_ Cnfrm - Random Forest. The lower right graph shows K_D_Cnfrm- Linear Regression.

Table 10: Comparing the Training Score and Prediction Score.

Figure 3: Actual Values and Predicted Values by Model.

Optimal Model Selection

Even in the results of random sampling, gradient boosting has an excellent model fit. R_Squared was 0.946, which was superior to AdaBoost. (Table 11) corresponds to the results of Random Sampling. In Predictions, Gradient Boosting was the best with 0.945. In cross-validation, 10-fold Cross-Validation and 20-fold Cross- Validation showed a higher model explanatory degree (Table 12). AdaBoost and Gradient Boosting were excellent in comparison by the model compared with CVRMSE and MSE. (Table 8) corresponds to Model comparison by CVRMSE and MSE. According to crossvalidation, when comparing the models with MSE, it is 0.685:0.315, which shows that the gradient boosting is excellent, and in CVRMSE, it is also 0.749: 0.251, which shows that the gradient boosting is also excellent (Table 13). In the AdaBoost model, the base estimator is set to Tree as a parameter. The number of Estimators is set to 50.

As good results can be derived from a small number, it is sensitively responding. Learning Rate is 1.00000, and the Fixed Seed for Random Generator is not set. SAMME. R and SAMME were used for the Classification Algorithm as a Boosting Method. Also, the regression loss function was calculated using Linear, Square, and Exponential. The obtained values are shown in the table below (Table 14). The table above compares the performance of the SAMME and SAMME.R algorithms. According to Zhu et al. (2009), SAMME.R uses probability estimates to update additive models, whereas SAMME is characterized only for classification. Because the number of iterations is the same, the values are the same. In addition, the SAMME.R algorithm generally converges faster than SAMME, achieving lower test errors with fewer boosting iterations. (Table 15) shows AdaBoost’s Training Score and Predictions, and the Training Score is overfitting from 0.997 to 1.000. The predictive power decreased slightly from 0.916 to 0.917.

Table 11: Random Sampling Result.

Table 12: 10-fold Cross-Validation.

Table 13: Model comparison by CVRMSE and MSE.

Table 14: The obtained values.

Table 15: Performance of Test on the training Data.

Gradient Boosting methods include Extreme Gradient Boosting (XGBoost), Gradient Boosting (Sickit-Learn), Extreme Gradient Boosting Random Forest (XGBoost), and Gradient Boosting (CatBoost). Estimated values for each of these methods are shown in the table below. For Basic Properties, the Number of Trees is 100, and the learning rate is 0.300. Replicable allowed. For regularization, Lambda was set to 1 or 3. The Limit Depth of Individual Trees for Growth Control is 6. For SubSampling, Fraction of Training Instances, Features for each Tree, Features for each Level, and Features for each Split were set to 1.00. (Table 16) shows Training Score and Predictions. In Training Score, Extreme Gradient Boosting (XGBoost) is 1.000, Gradient Boosting (CatBoost) is 0.995, while Gradient Boosting (Sickit-Learn) is 0.990. The above AdaBoost was also 1.000, and the performance was excellent. However, there is a risk of overfitting. An appropriate model is selected by looking at the values in Predictions (Figure 4).

Figure 4: Gradient Boosting (CatBoost) and Predictions of AdaBoost.

Table 16: Training Score and Predictions of Gradient Boosting.

Extreme Gradient Boosting (XGBoost) is 0.945, Gradient Boosting (CatBoost) is 0.948, Gradient Boosting (Sickit-Learn) is 0.933. The above AdaBoost is also 0.916~0.917. In this study, Gradient Boosting (CatBoost), which has the best predictive power, was relatively good at 0.948, so it was estimated using this model. The graph on the left compares the actual test data with the predicted value of Gradient Boosting (CatBoost), the graph on the right compares the predicted value of AdaBoost with the actual tester, and it shows that the predicted value of Gradient Boosting (CatBoost) is more accurate. According to CatBoost’s prediction results, the predictive power was poor in the area where the number of confirmed cases due to the mutated virus rapidly increased. In particular, it was confirmed that the CatBoost model was effective in predicting small and irregular infections in the early stage, but it was confirmed that the prediction of the period of the rapid increase in the number of confirmed cases due to delta mutation was somewhat ineffective.

Conclusion

Sufficient learning data was used for training using data corresponding to the most recent period. The number of confirmed cases was predicted based on the latest information, including those who received primary vaccination and those who completed vaccination. We used a predictive model that learned only corona confirmed cases information, subdivided it by parameters, and proposed an accurate and effective predictive model for the number of corona confirmed cases. Neural networks, ensembles, distancebased models, and linear regression were used as supervised learning models for various machine learning models. As for the model with excellent predictive power, the training scores such as Gradient Boosting and AdaBoosting had high training scores. CatBoost showed the best predictive power among the Gradient Boosting models through cross-validation by model. About 94.8% of the predictions were accurate.

According to CatBoost’s prediction results, the predictive power was poor in the area where the number of confirmed cases due to the mutated virus rapidly increased. In particular, it was confirmed that the CatBoost model was effective in predicting small and irregular infections in the early stage, but it was confirmed that the prediction of the period of the rapid increase in the number of confirmed cases due to delta mutation was somewhat ineffective. As a future research task, it is necessary to implement and compare prediction algorithms using machine learning techniques trained in unsupervised learning. In addition, it is necessary to make a prediction using the policy variables to be considered, such as the stage and implementation of the social distancing movement.

Author Contributions

Conceptualization, J.Moon and J.Lee; Methodology, J.Moon; Project Administration, H.Lee and H.Yun; Data curation, J.Moon and H.Lee; Result data acquisition, J.Moon and J.Lee; Wrighting-original draft preparation, J.Moon; Visualization, H.Lee; Funding acquisition, J. Moon and H.Yun; Supervision, J. Moon and H.Yun. All authors have read and agreed to the published version of the manuscript.”

Funding

This research was supported by a grant (2021-MOIS61-02- 0000-2021) of Development of location oriented virus safety map funded by the Ministry of Interior and Safety (MOIS, Korea).