Logistic Model of Credit Risk During The COVID-19 Pandemic

In this paper, the Markov Chain Monte Carlo (MCMC) method is used to estimate the parameters of Logistic distribution, and this method is used to classify the credit risk levels of bank customers. OpenBUGS is bayesian analysis software based on MCMC method. This paper uses OpenBUGS software to give the bayesian estimation of the parameters of binomial logistic regression model and its corresponding confidence interval. The data used in this paper includes the values of 20 variables that may be related to the overdue credit of 1000 customers. First, the “Boruta” method is adopted to screen the quantitative indicators that have a significant impact on the overdue risk, and then the optimal segmentation method is used for subsection processing. Next, we filter three most useful qualitative variable According to the WOE and IV value and treated as one hot variable. Finally, 10 variables were selected, and OpenBU-GS has been used to estimate the parameters of all variables. We can draw the following conclusions from the results: customer’s credit history and existing state of the checking account have the greatest impact on a customer’s delinquent risk, the bank should pay more attention to these two aspects when evaluating the risk level of the customer during the COVID-19 pandemic.

a classification algorithm whose output is always between 0 and 1. First, let's take a look at the LR of the dichotomy. The specific method is to map the regression value of each point to between 0 and 1 by using the SIGmoid function shown in Figure 1. As shown in the figure . Z w x b = + when z > 0, the greater z is, the closer the sigmoid returns to 1 (but never more than 1). On the contrary, when z < 0, the smaller z is, the closer the sigmoid return value is to 0 (but never less than 0). This means that when you have a binary classification task (positive cases corresponding labeled 1, counterexample corresponding labels 0) and samples of each of the sample space for linear regression . Z w x b = + , then the mapping using sigmoid function of g = sigmoid (z), and finally output the corresponding class label each sample (all value between 0 and the one greater than 0.5 is marked as positive example), then, two classification is completed. The final output can actually be regarded as the probability that the sample points belong to the positive example after the model calculation [1].
Thus, we can define the general model of the dichotomous LR as follows: For a given input ( ) can be obtained, and the instance x will be classified into the category with high probability value. Odds of an event refers to the ratio between the probability of its occurrence and the probability of its nonoccurrence. If the probability of its occurrence is P, the probability of the event is P/(1-P), and the log odds or logit function of the event is logistic regression can be obtained.
That is, the logarithmic probability of output Y=1 in the logistic regression model is a linear function of input X. When learning logistic regression models, for a given data set The likelihood function is The logarithmic likelihood function is By gradient descent algorithm and newton method can get the maximum value in the L(w) and the estimates of w: w ∧ then the logistic regression model :

MCMC
The formula of Markov Chain is as follows That is, the state transition probability value is only related to the current state. Let P be the transition probability matrix, where P ij represents the probability of the transition from i to j So we can prove that ( )  π π π π π π π π Where π is the solution to p π π = . Since the probability of x obeys ( )

Data Description and Preprocessing
The German credit card data set is adopted in this paper, which contains 20 variables, including 7 quantitative variables and 13 qualitative variables. The details are shown in Table 1. The data set includes 20 variables, the influence of different variables on credit overdue is different, adopting too many variables will not only increase the cost of collecting data, and wast customer's, also increases the complexity of the model, reduce the accuracy of prediction, so before to fitting the model we need to screen all the indicators which have a significant effect. The following content will be introduced from the screening of quantitative indicators and the screening of qualitative indicators.

"Boruta" Screening of Quantitative Indicators
The goal of Boruta is to select all feature sets related to dependent variables, which can help us understand the influencing factors of dependent variables more comprehensively, so as to conduct feature selection in a better and more efficient way. e) The real feature whose Z_socre is greater than max Z is marked as "important", the real feature whose Z_score is significantly less than max Z is marked as "unimportant", and is permanently removed from the feature set.

Algorithm Process
f) Delete all shadow features.
Where N is the number of grouped groups, and IV can be used to represent the grouping ability of a variable, as shown in Table   2. To make the difference between groups as large as possible, smbinning package in R softwear is used to segment the continuous variable duration, credit_amount and age using the optimal segmenting method. The result of segmenting is shown in Figure 3.      Table 5.

Model Training and Prediction
The significance level was set as 0.05. A total of 10 variables were screened to establish the model: Where 0 β is the constant term, ( 1 β =1,2,...,10) is the partial regression coefficient of independent. Variable Parameters of the model have been given independent "non-informative" prior distribution, and OpenBUGS software is used for modeling and sampling, as well as Doodle modeling through OpenBUGS, to specify the distribution type and logical relationship of various parameters, as shown in the Figure 4: Each ovals represent a node IN the graph, rectangle with constant node, single arrow from the parent node to the random child nodes, hollow double arrows indicate the parent node to the logical type child nodes, the rectangular outside for tablet, the lower left corner "for (I IN 1: N)" said for loop, is used to calculate the likelihood function of all samples, and the overall likelihood function is obtained [3]. The posterior distribution statistics for each parameter were obtained using OpenBUGS software, as shown in Table 6.
When dividing the overdue risk level of customers, there may be two wrong divisions, that is, dividing "high-quality customers" into high-risk customers and high-risk customers into "high-quality customers". Generally speaking, the economic costs of these two wrong divisions are different. For Banks, the cost matrix is shown in the Table 7 (0=Good, 1=Bad) [6]. the rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (cost=5), than it is to class a customer as bad when they are good (cost=1). Define

Nuclear Density Figure
According to the distribution density of extracted samples, it can be seen that the samples extracted by Gibb's algorithm are mostly concentrated in a small area, which can also explain the convergence of Markov chain [8].

Autocorrelation Diagram
Autocorrelation plots clearly indicate that the chains are not

Conclusion
This paper constructs a binomial logistic regression model based on the customer characteristic data of Banks. Content mainly includes two parts, the first is the part of data pretreatment, the original data contains 20 variables, in order to make the model more concise, and improve the accuracy of classification model, reduce the cost of information collection and the time cost of customers, using "Boruta" method of screening of three quantitative indicators, and use the optimal segmentation method will be treated as continuous variable section. Then, three qualitative variables were selected into the model by calculating the IV value of the variable, and the qualitative variables were treated with a unique heat type.
Two logIST-IC regression of SPSS software was used to screen out  All the selected variables were brought into Open BUGS software to obtain the parameter Bayesian estimation of the binomial logistic regression model. From the estimation results, it can be seen that the customer's historical credit (Credit_history) and current economic status (Checking_account_status) have the greatest impact on credit delinquency. Banks should pay more attention to these two aspects when evaluating the customer's credit risk level [10].

Conflict of Interest
We have no conflict of interests to disclose, and the manuscript has been read and approved by all named authors.