Modeling and Predicting Infectious Diseases Cases with Climatic Factors in Hong Kong

Infectious diseases remain a main concern in the worldwide since the epidemics may cause up to five million severe illness and 500,000 deaths each year [1]. Many northern countries adopt surveillance and vaccination to prevent prevails. However, many countries in tropics underutilize the prevention strategy despite the year round outbreaks [2]. The recent studies show that the new sources of infectious disease mainly come from East and Southeast Asia [3,4]. The various ways of transmission and scarce surveillance data make the prevention of infectious diseases more difficult. The infectious diseases in temperate areas always appear in cold and dry climate [5,6]. In cold and dry weather, people may prefer to crowd indoor which will definitely lead to higher risk of contact virus transmission [7]. On the other hand, the cold and dry weather is most favorable for virus transmission [8,9]. In addition to humidity and temperature, the solar radiation has also been considered in the virus transmission in the temperate climate [10]. However, the role of climate on the infectious diseases transmission in the tropics attracts less attention. Several regions observe high infectious disease transmission in the rainy seasons such as India, Vietnam and Brazil [11-14]. While in areas such as Singapore, Thailand and Philippines, the annual peaks of infectious diseases do not coincide with the rainy seasons [11-14].

with best performance can be used to predict infectious diseases outbreaks that can help develop vaccination strategy and allow the hospitals to distribute the treatment resources efficiently.

Methods and Data
This study uses the monthly count of infectious diseases data in Hong Kong. We obtain the infectious cases data between January 2003 and December 2018, from the monthly statistics published by the Department of Health, Government of the Hong Kong Special Administrative [17]. The climatic parameters are collected from the Hong Kong Observatory with the same frequency and period [18].
We divide dataset into two: The infectious cases time series that we analyze in this study is characterized by a strong autocorrelation, a property that commonly violates the ordinary linear regression. Thus, in order to account for the autocorrelation behavior, we employed a class of time series technique ARIMA. We first developed a univariate ARIMA model, where the response series depends only on its past values and some random shocks, followed with multivariate ARIMA with the environmental parameters as inputs. ARIMA is based on the assumption that the response series is stationary, that is the mean and variances of the series are independent of time. Stationarity can be achieved by differencing the series or transforming the variable so as to stabilize the variance or mean.
In our analysis we take the logarithmic transformation to reduce the variances of the infectious time series, and subsequently differenced the series until it is stationary. Once the response series is stationary, we examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) to determine the initial AR (autoregressive) and MA (moving average) order. An ARIMA model is notated as ARIMA (p, d, q), where p indicates the AR order, d the differencing order and q the MA order. Based on the ACF and PACF we fit several ARIMA models with varying AR and MA orders. In the fitting process, the AR and MA coefficients are estimated using conditional least square method.

Results
In this paper, we employ a time series model ARIMA to analyze the infectious diseases transmission in Hong Kong during the past 17 years. In the first step, we need to stationarize the series of monthly infectious illness amounts in Hong Kong as shown in Figure 1. By taking the log transformation of the series to reduce the variance of the infectious cases, we can get the stationary series. Then ACF and PACF are used to identify the specific order of the series. Both ACF and PACF cut off at lag 2. Furthermore, we fit several univariate ARIMA models of different orders to exclude models with residual exhibiting autocorrelation. The results can be obtained in Table 1.
As we can see in Table 1, for the fitted dataset, the ARIMA (2,1,2) get the best performance for the criteria of RMSE, while ARIMA (2,1,1) has the best predictive RMSE and lowest AIC. Among the two different univairate models, the difference AIC of two model is 5%, fit RMSE is 17%, the predictive RMSE is 9%. Since the fit RMSE gets the biggest difference and the other two are relatively smaller, we will choose the model ARIMA (2,1,2) as baseline model for further comparison. In the next step, we need put the environment factors into our model to examine if the performance can be better improved. We first examine the correlations between the infectious cases and environment series. The experiment results as showed in Table 2 confirm the significant correlations between infectious cases and temperature at lag 2, rainfall at lag 3. Then the multivariate ARIMA models are estimate with one or more environmental factors. The performance of these models is showed in Table 1. For these multi-variable models, the best fit RMSE is obtained from ARIMA (2,1,2) with temperature and rainfall. ARIMA (2,1,1) with temperature has the lowest AIC and ARIMA (2,1,1) with rainfall has the best prediction RMSE. Compared with these three models with the above three baseline univariate models, we can find that the models with environmental factors included enhance the fit RMSE by 8%, the AIC by 14% and the prediction RMSE by 11% from the baseline univariate models. Among the three best multivariate models, ARIMA (2,1,1) with rainfall has highest AIC. Thus, we exclude this model from our list. Between ARIMA (2,1,2) with both temperature and rainfall and ARIMA (2,1,1) with temperature, the difference of AIC is 3%, however, the difference of fit RMSE is 19%. So, we choose the ARIMX (2,1,2) model with temperature and rainfall as inputs as the best model.  We can find that the univariate ARIMA can forecast one-step ahead future infectious cases relatively well. The best univariate model is ARIMA (2,1,2) in which the infectious cases are depended on the cases in the past two months. In the multivariate ARIMA models, we find that the temperature and rainfall are significantly related to the infectious cases in Hong Kong. The relationship between rainfall and infectious diseases is observed in tropical countries in Singapore, Brail and Thailand [12][13][14]. There is any connection between with infectious transmission effectiveness, virus survivorship or host susceptibility. In common sense, rainfall may cause changes in the social activity which in turn promote the transmission of infectious disease. For example, in rainy days, people may prefer to stay indoors and thus promote the chance to contact with other people.
The rainy season is between April and September in Hong Kong.
Meanwhile, the infectious disease transmission peaks are typically around March and April, which is considered to be in the rainy season.
Temperature is always accompanied with infectious diseases such as in Tokyo [6]. Especially in the northern region, the infectious diseases peaks always coincide with winters. The prevailing dry and cold climate during winter seems to enhance infectious disease transmission, though this is not the same case in the tropics. Lowens st al find that low temperature (5°C) and small amount of rainfall Most of the models developed depend on the past one to two weeks infectious cases. A more common way is to predict the infectious cases with more than one-step ahead forecasts. That is to say, the future forecasts are calculated with previously predicted number of cases instead of using the actual cases from the surveillance data (as in one-step ahead approach). However, one caveat to this approach is that more data is needed, since model selection will be based not only on the RMSE of the fitting dataset but also on the prediction dataset.

Conclusion
In this study, we combine the climatic parameters with the infectious disease cases through a time series model ARIMA.
Through the comparison of several different models, we can conclude ARIMA (2,1,2) with temperature and rainfall included outperform other models and is the best model to predict the infectious diseases cases in the next period. Also, this model can approximately explain the two peaks of the infectious cases each year. Finally, the models in this study are a first step towards developing an early warning system for infectious diseases.