WD

Fit the COVID spike data

In a previous article, we described how to find spikes for the COVID data collected for South Africa. We also toyed with the idea of getting a simple model in another article.

This time we bring some of this work closer together as we look at: fitting real data from a spike to a mathematical model.

Overview #

This article starts by taking a closer look at the data we want to fit. It then describes a refined model for growth. After this, we take a closer look at what it means to fit data and how we can decide on the best fit. We then take a look at how the fitting procedure can be automated. Finally, this article shows the results of the best fit we found for the model to the data.

The all-important source data #

For this article, we pick the second spike from the ZA data. The spike started on 22 September 2021 and ran for 103103 days until the beginning of January 2022.

When we identified these spikes, we used the new cases counts. We noticed that there was a date on which the count of new cases started to increase and later, after reaching an apex, the new cases per day started to decline.

The problem we work on here does not concern itself with the spikes, it is all about the growth of the total number of infected individuals. So instead of new cases, we look at the cumulative cases. The daily cumulative cases is a measure that increases all the time, and does not help much to identify spikes - but is it of primary concern if you want to see what the impact of a virus is on a population.

In the data sourced from the HDE[1] we found some days for which a 00 was recorded for new cases. These zeroes create small flat lines in the cumulative values and can add up to cause noticeable deviations from what we expect. We regard these zero values as days that were for some reason not reported, and therefore we consider them to be anomalies.

We iron out these anomalies by replacing every zero value with the average of the surrounding non-zero values. In Fig. 1 you can see the original cumulative cases (grey line) along with the cleaned data (blue line). As you can see, for this spike the difference is negligible, but it is better to be safe, so we continue to use the cleaned data.

Fig 1: The data from the second spike identified for ZA

A careful look at the data line reveals that the trend is very linear at the start and at then turns later to become more exponential. We know our model is exponential, so what is going on here?

This linear start could relate to how we identified the spike. Recall from the related article that the first point (22 Sept.) is a minimum that was marked as the beginning of a wave. This minimum could instead be the start of a plateau that went on for a week or two. We do not know the exact date when the relative "normality" turned into a higher infection rate that eventually became the second wave. All we know is that this day was some time after 22 September.

Presenting the model #

In the two articles preceding this one, we spent some energy explaining how to get from an idea to a mathematical expression. The idea we use in this article follows similar lines of thought.

The first assumption we make is that there is some constant ρ\rho that ties down the growth rate of new infections. In simple terms, we say that if there were aa infected people when the day starts, we will have ρ×a\rho \times a new cases by the end of the day. Surely things are not that simple? But, for a model that aims to describe what is going on during a spiky trend, this could be a good start.

The next assumption is that we know the day when the exponential growth started. We use tt to count days and on this day, we say t=0t = 0.

The third assumption is that we know the day on which the apex was reached. We say that the exponential growth ends when t=nt = n for an exponential growth that lasted nn days.

To illustrate these concepts, take the start day as 25 September. Then, the 22 of September is where t=3t=-3. Also, you then have the number of days equal to n=1033=100n=103-3=100.

We define d0Zd_0 \in \mathbb Z to be the index start day. In general, we have t=d0t = -d_0 for the first entry of the spike data, and t=0t=0 for the d0d_0-th entry of the data.

Another assumption we make is that the new cases recorded for the start day (t=0t=0) is the crowd that initiated the wave. We define y0y_0, obtained from the data, to be the number of people in this crowd.

Combine the last assumption with the first, and you will see that we expect a constant of y0×ρy_0 \times \rho new individuals to be infected by the initial crowd for days 11 to nn.

With all these assumptions, it is easy to be sceptical about our model. But remember: a model has some purpose to which it is applied. It is not true or false when it stands on its own. Any model is by definition, a simplification (or gross approximation) of some reality; and should never be confused with the reality itself. We do not expect any reality to conform to our current model and we, therefore, have no reason to be sceptical about it. We are, however, interested to see how well this model fits the observations that were gathered from the real world.

Let us now proceed with the incorporation of these assumptions into mathematical expressions that we can work with. (The careful reader should spend some time here to make sure he understands how these equations follow from the assumptions mentioned above.)

The differential equation we have as input is:

And, as an initial value we have:

When we solve this ODE[2] we get:

The meaning of fitting #

To get an idea of how model fitting works, have a look at Fig 2.

Fig 2: An example fit where the data is the blue line and the orange line is the value predicted by the model (d0=20d_0=20 and ρ=0.0592\rho=0.0592)

Notice in Fig. 2 how the data (blue line) when you compare it with Fig. 1, has not changed shape or location. But other things have changed.

The origin of tt and the yy-axis have shifted. The tt axis starts at 2020, and when t=20t=-20 the yy-value of the data is negative. More importantly, the yy-axis has been normalised so that y=0y=0 when t=0t=0.

All this shifting is only to bring the data into a context that our model can work with. There is no trickery here - all the adjustments are aligned with the assumptions we made earlier and the equations we have for the model.

This shifting, however, is not fitting. We fit a model by assigning values to the model parameters. In Fig. 1 we assigned specific values to d0d_0 and ρ\rho. The orange curve shows the prediction made by the model for this assignment.

We define function m:R3Rm: \reals^3 \mapsto \reals to capture the concept of a model prediction. The function mm follows directly from Eq. 3:

Having mm is a good start: one of our model parameters, ρ\rho, is visible, but the other parameter is d0d_0 while the function used y0y_0. There must be a way to calculate y0y_0 from d0d_0. What we need for such a calculation is a function that extracts the cumulative cases for a particular day from our data.

We define the function d:Z2Rd: \mathbb Z^2 \mapsto \reals such that d(k,t)d(k,t) is the number of cumulative cases for day tt if k=d0k=d_0. We also define the last element in the data to be at index t=nd0t=n-d_0 for some value nn that is constant for all values of kk.

From this definition, we can obtain a value for y0Ry_0 \in \reals:

Also note that d0>0d_0 \gt 0, and

In Fig. 2 you see for the blue line, the function d(20,t)d(20,t) and the orange line is y(20,0.0592,t)y(20,0.0592,t).

When one uses the parameters to create two other functions that range over tt, then dd and tt become higher-order functions[3]. One can view a model to be a family of functions that such that each one in the family is generated from a higher-order function. The idea of fitting a model is to choose from this family, the individual function that fits the model best.

Deciding which fit is the better one #

Again, using Fig. 2 as an example, it is easy to see that a better fit is found when the orange and the blue line are closer to each other. The closeness can be measured as the difference between the two functions. If you square this difference you get rid of the negative values that sometimes occur when functions are subtracted from each other.

This is a well-known method to measure all sorts of errors - it is called the mean squared error, or MSE[4] for short. Using MSE, we now define the function E:R2RE : \reals^2 \mapsto \reals to determine our error measurement, as

Where the definitions of mm, dd and y0y_0 are from the previous section.

The best fit is the parameter pair (d0,ρ)(d_0,\rho) that has the smallest value for E(d0,ρ)E(d_0,\rho). In principle easy to calculate, just go through all 103103 days and all values of ρ\rho, calculate EE for each pair, and the smallest number if your answer.

The number of days is finite and (in practice) a number that can be used in a program that automates the required computations. But ρ\rho is a bit more problematic. What range of values should we run over to find the smallest error for this parameter?

An estimate for the growth factor #

For this article, we are interested in a pragmatic solution for an optimal ρ\rho, one that will satisfy us that the model works. The road to such a solution starts with getting a rough estimate.

To get this estimate we consider the last point in the data - that is when t=nt=n. We use Eq. 4 to get an expression from the mode from that value:

We then consider what happens when we force this last point to be on the model curve. In other words, we equate the data at this point to the right-hand side of Eq. 8:

We can now find an estimated value for ρ\rho from Eq. 9

Now take the range [ρe2;3ρe2][\frac{\rho_e}2;\frac{3\rho_e}2] and calculate a finite number of equally spaced values in this range. We take as answer for that day, the value for ρ\rho that produced the smallest MSE.

Armed with this ρe\rho_e based range, we can now write a program that steps over all the days and some value within the ρ\rho range for each day to find the parameter pair with the smallest MSE.

A comparison of daily fits #

It is interesting to see what this program spits out as the minimum error it finds for each day. The values can be seen in Fig. 3 below. Note that we show ln(E)ln(E), so the growth might look close to linear, but that is only because of the scale.

Fig 3: A plot of the minimum value ln(E)ln(E) for each day, tt.

From the figure, we can see that the minimum d0d_0 is before day 2020. The visible increasing trend in the error measurement for larger values of d0d_0 is expected: as the model curve moves towards the right, more data points will fall below the line.

Conclusion #

The best fit the program found was for d0=17d_0 = 17 and ρ=0.0475\rho = 0.0475. You can see this optimal curve as the orange line in Fig. 4.

Fig 4: A plot of the minimum value ln(E)ln(E) for each day, tt.

Our analysis says that the spike started on 9 October and lasted for 8686 days. On that day there were 750750 new cases.

There are other strategies to fit a model of this kind, but this direct approach works well. Especially in the latter part of the spike period we see an excellent alignment with the recorded observations.

References #

  1. The source data is provided by the Humanitarian Data Exchange. You can download it yourself from here.
  2. Ordinary differential equations - Wikipedia
  3. Higher-order function - Wikipedia
  4. Mean squared error - Wikipedia