COVID-19 Probabilistic inference with logistic growth
Updated: Sep 30, 2020
Authors: Are Haartveit, Harald Husum and Tom Van de Wiele with equal contributions
Continuation of Part 1: COVID-19 projections.
DISCLAIMER: The modeling effort described in this article aimed to predict the first wave of COVID-19 related deaths and is not updated since May 3, 2020. The graphs and explanatory text will not be updated.
In this part of the blog post we dig deeper into the data sources that were used for our analysis. We also describe our interactive data plots and explain how we make predictions for the evolution of the pandemic in each of the studied countries.
Our aim is to combine the best of the plotting that is out there, make it interactive and update it with the most recent data. Therefore we decided to create a notebook that is hooked up with the most recent data and can be run again as new data comes in. As for the modeling, the reasoning is similar: we want to use the best technical approach for this type of problem (probabilistic inference), make it interactive and update it with the most recent data.
Please note that the interactive plots are optimized for desktop use. We update our plots on a daily basis. Please consult our notebook if you want to generate forecasts with the most recent data.
A sneak peek of the daily death forecasts can be found below. It seems like many of the worst affected European countries such as Spain, Italy, Belgium and France are near or past their respective peaks of daily deaths. The predictions for countries that are in a relatively early stage of the epidemic, such as Australia, India and Indonesia, are subject to high variance.
In this analysis we use data on COVID-19 cases and deaths on a country by country basis. There are quite a few datasets that cover this, but the most accurate one we could find originates from the Johns Hopkins Coronavirus Resource Center, and is published as a github repository. Even though this dataset contains data for regions within certain countries, we decided to aggregate this data in order to do the analysis on a country level. This dataset also does not cover the original outbreak in China, so we use the daily released spreadsheets from the European Centre for Disease Prevention and Control for our China specific data. These combined datasets provide a fairly accurate image of the current state of the pandemic.
For population data, we use the United Nation’s probabilistic estimates for 2020. These are not the most accurate or up-to-date data for any given country, but were easy to combine with our other datasets. They are certainly accurate enough for the purposes covered in this investigation.
When attempting to gain knowledge from data, a good initial step is to perform some kind of exploratory data analysis. In our case, due to the wide interest in the spread of SARS-CoV-2, there are already a lot of good graphical representations of our data out there. Consider for instance the Johns Hopkins interactive case tracking map. We focus on graphical representations that are not as readily available, or that we find particularly interesting.
One feature we felt was lacking in a lot of visualizations is the ability to easily compare countries. It is desirable to be able to know which country is an infection hotspot at any moment in time. Our first plot shows us that the US has surpassed Italy as the global epicenter of confirmed new infections (as of March 22).
The similar plot for deaths indicates that the US has recently become the global epicenter of deaths due to COVID-19 (as of April 1).
The countries in these plots are all at different stages of the epidemic. In order to better compare the trajectories that different countries follow, it can be useful to offset the starting point of each country, so that the relative stages of the epidemic align. We assume that countries that are now at the start of their trajectories will roughly follow the patterns observed in countries that have already seen widespread infection. Given this assumption, plots with some kind of common anchor point can help us form some rough prognosis for how the epidemic might develop in countries who are currently seeing the start of their outbreak. We see that there is quite a bit of variability in the development of both deaths and cases across countries. We can see that, for now, Spain and South Korea represent the pessimistic and optimistic outliers in terms of how the death count can develop.
Absolute numbers of both confirmed cases and fatalities are really only comparable between countries in the early stages of an epidemic. Given similar responses to the epidemic in countries of different size, we expect the amplitude of their graphs to be proportional to the total population. In order to account for this we also plot population relative numbers. The plots that are scaled by population show less variability than the absolute value plots, and might thus give us clearer prognosis for countries.
From a population relative perspective it is obvious how effective China has been at stopping their epidemic. Likewise, it is clear that Spain is going down a trajectory that is more severe than the Italian one. We can also see that Belgium is doing the worst at the time of this writing (April 15). An important side note here is that there is more variation for countries with less inhabitants.
In our attempt to model the future of the outbreak, we decided to focus on the number of reported deaths. The number of reported deaths due to COVID-19 is likely to be approximately accurate and therefore comparable between countries.
The number of confirmed cases, however, suffers from under reporting, availability of sufficient tests and variability of testing procedures. As the epidemic progresses, countries tend to switch from following up close contacts of confirmed cases in combination with extensive testing to focusing on the most urgent cases, resulting in more cases not being confirmed due to the absence or mildness of disease symptoms. As a result, there is significant noise in the timing and total count of confirmed cases, leading to large differences between countries in terms of the estimates for the proportion of symptomatic cases reported.
As opposed to the number of confirmed cases, the number of deaths also seems more clinically relevant as it can serve as a proxy for the load on the healthcare system in a country.
We decided to only model countries that have confirmed at least 1000 cases since it is prohibitively hard to forecast the number of deaths based on a low number of confirmed cases.
Due to the limited resolution of the data (daily) and the limited number of countries, we adopted a probabilistic (also called Bayesian) framework so we can incorporate sensible prior information and obtain reliable uncertainty estimates.
In probabilistic programming, you assume there is a collection of parameters that defines the probability distribution of interest using a graphical model. The inference procedure consists of using the observed data to update the prior estimate of the parameters to the posterior distribution of the parameters. This posterior distribution contains information on the updated distribution of the parameters and can be used to predict new values with an estimated degree of certainty.
The graphical model in our forecasts is intentionally kept simple in order to avoid overfitting and maximize interpretability. Every country is modeled independently, and we only make use of the daily deaths data.
We assume that the number of deaths follows a logistic curve. This function is used to model growth in a constrained environment, where the growth rate will reach a peak, then start to slow down. It is commonly used in biology to model population growth and also appears to be a good fit for the projected number of deaths due to COVID-19.
We define three parameters to describe the derivative of the logistic curve: the central time (halfway point of maximum daily deaths), the peak number of daily deaths and the width of the curve. Given the estimated number of daily deaths on the logistic curve, we assume that the number of reported deaths follows a Poisson distribution.
To estimate the posterior distribution of the parameters for our graphical models, we make use of the Python package PyMC3. The PyMC3 package focuses on advanced Markov chain Monte Carlo and variational fitting algorithms to perform probabilistic inference. We use the standard inference settings and show forecast plots for the mean posterior parameters, together with 100 random samples from the posterior distribution. This combines a “best guess” of the evolution of the epidemic, combined with a degree of uncertainty of the model fit.
The choice of the prior distribution has an important impact on the posterior distribution in cases like this where there is limited data. We decide to select uninformed priors but want to stress that the choice still has a considerable impact on the posterior distributions.
We assume a uniform prior for the central time of the logistic curve in the period between December 31, 2019 (epidemic start date) and the epidemic start date plus twice the number of days between today and the epidemic start date.
The peak number of daily deaths is assumed to follow a log-uniform distribution, expressed as a fraction of the country’s population, with between 10^(-9) and 10^(-4) dying at the worst point.
As for the width constant of the logistic curve, we decide on a uniform prior between 3 and 15, meaning that 90 % of deaths occur in a window of 9 to 44 days.
Interpretation of the forecasts
The forecasts show the logistic curve that is the best fit to the observed data. The best fit is defined as the logistic curve with the mean of the posterior parameters for the central time, peak and width. We also show 100 random samples from the posterior distribution to get an estimate of the uncertainty of the best fit.
The assumption of using the logistic curve proves to be a good fit for countries where the peak of the number of deaths seems to be in the past. Examples of this are China, Iran (to a lesser extent) and South Korea which, at least partly, justifies the choice for the logistic curve.
A limitation and simultaneous strength of our approach is that it is purely data driven and does not take government actions into account. Therefore, the forecasts should only serve as an additional data point in combination with extensive simulation studies and country specific forecasts that take several additional measurements into account such as the ICU capacity and the evolution in the number of confirmed cases.
In countries with few deaths such as Norway and Australia, the degree of uncertainty on what the death curve will look like decreases every day as new data comes in. An interpretation of those forecasts should therefore consider the uncertainty, and the visualization that shows samples from the posterior distribution should therefore be preferred.
Our forecasts indicate that Italy, the most affected country in terms of the total number of deaths, is almost certainly past its peak and that there is hope for the number of deaths to go down further in the coming days. The number of deaths in Spain is also very likely to continue going down at the time of this writing (April 15).
The United Kingdom seems to be close to the likely peak in the number of deaths.
Belgium, one of the worst countries in terms of the cumulative number of deaths per capita since reaching a minimum number of 1 death per 1 million inhabitants, seems to be past its peak number of deaths, with our models suggesting the death toll will likely stabilize or go down in the coming days.
The most likely fit at the time of this writing (April 15) for the United States of America also predicts that the daily deaths is close to its peak. Given the size of the country, a mixture of logistic curves would probably be a better fit for the data so this prediction may very well be too optimistic.
When summing up the daily deaths we obtain the following cumulative forecasts:
How to interact with our analysis
Our technical analysis is coded up in Google Colab and can be found here. Google Colab is a research project by Google that consists of a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud. Feel free to modify and build on our code but please mention the source when doing so.
Our plots are built with the Python library plotly. It is an interactive plotting library that builds on Dash and allows for natural interaction with the data.