Model definition: estimate_infections()
Source:vignettes/estimate_infections.Rmd
estimate_infections.Rmd
Infection model
estimate_infections()
supports a range of model
formulations. Here we describe the most commonly used and highlight
other options. The two main models for how new infections arise in the
model are a renewal equation model and a non-mechanistic
infection model. The initialisation of both of these models
involves estimating an initial infection trajectory during a
seeding_time
(set to the mean of modelled delays from infection to observation) that
precedes the first observation at time
.
Renewal equation model
This is the default model and is used when rt != NULL
.
New infections are generated at discrete time steps of one day via the
renewal equation[1]. These
infections are then mapped to observations via discrete convolutions
with delay distributions.
Initialisation
The model is initialised before the first observed data point by
assuming constant exponential growth for the mean of modelled delays
from infection to case report (called seeding_time
in the model):
where is the number of latent infections on day , is the estimate of the initial growth rate, and and are estimated from the first week of observed data: as the mean of reported cases in the first 7 days (or the mean of all cases if fewer than 7 days of data are given), divided by the prior mean reporting fraction if less than 1 (see Delays and scaling); and as the point estimate from fitting a linear regression model to the first 7 days of data (or all data if fewer than 7 days of data are given),
where is the number of reported cases on day , an estimated intercept, and the error term.
Infections
For the time window of the observed data and beyond infections are then modelled by weighting previous infections with the generation time and scaling by the instantaneous reproduction number:
where is the distribution of generation times (with discretised gamma or discretised log normal distributions available as options) with mean (or log mean in the case of lognormal distributions) , standard deviation (or log standard deviation in the case of lognormal distributions) and maximum . Generation times can either be specified as coming from a distribution with uncertainty by giving mean and standard deviations of normal priors, weighted by default by the number of observations (although this can be changed by the user) and truncated to be positive where relevant for the given distribution; or they can be specified as the parameters of a fixed distribution, or as fixed values.
The distribution of generation times here represents the probability that somebody who became infectious on day 0 and who infects someone else during their course of infection does so on day , assuming that infection cannot happen on day 0. If not given this defaults to a fixed generation time of 1, in which case represents the exponential of the daily growth rate of infections.
Time-varying reproduction number
Different options are available for setting a prior for , the instantaneous reproduction number at time . The default prior is an approximate zero-mean Gaussian Process (GP) for the first differences in time on the log scale,
More details on the mathematical form of the GP approximation and implementation are given in the Gaussian Process implementation details vignette. Other choices for the prior of are available such as a GP prior for the difference between and its mean value (implying that in the absence of data will revert to its prior mean rather than the last value with robust support from the data).
or, as a specific case of a Gaussian Process, a random walk of arbitrary length .
where indicates interval-valued division (i.e. the floor of the division), such that for example indicates a daily and a weekly random walk.
The choice of prior for the time-varying reproduction number impact run-time, smoothness of the estimates and real-time behaviour and may alter the best use-case for the model.
The initial reproduction number has a log-normal prior with a given log mean and log standard deviation , calculated from a given mean (default: 1) and standard deviation (default: 1).
The simplest possible process model option is to use no time-varying prior and rely on just the intial fixed reproduction number .
Beyond the end of the observation period
Beyond the end of the observation period (), by default, the Gaussian process is assumed to continue. However, here again there are a range of options. These included fixing transmission dynamics (optionally this can also be done before the end of the observation period), and scaling based on the proportion of the population that remain susceptible. This is defined as followed,
where
are cumulative infections by
and
are the unadjusted infections defined above. This adjustment is based on
the one implemented in the epidemia
R package[2].
Non-Mechanistic infection model
This is an alternative model that can be used by setting
rt = NULL
that assumes less epidemiological mechanism by
directly modelling infections on a log scale with a range of process
models. By default, this uses a Gaussian Process prior for the number of
new infections each day (on the log scale) although alternatively
infections can be estimated using a prior based on a fixed backwards
mapping of observed cases. In general, these model options will be more
computationally efficient than the renewal process model but may be less
robust due to the lack of an epidemiological process model (i.e. more
dependence is placed on the assumptions of the Gaussian process
prior).
Initialisation
In order to initialise this model, an initial estimate of the infection trajectory is first created by first shifting observations back in time by and then smoothing the observation data with a moving average of window size (default: ), allocated to the centre of the window:
where is the day of the last observation, is rounded up to the nearest integer in the limits of the sum, and . Any date with cases following this procedure is then allocated 1 case to facilitate further processing in the model.
For any times the number of infections is then estimated by fitting an exponential curve to the final week of data and extrapolating this until the end of the forecast horizon.
Infections
By default, a Gaussian Process prior is used for the number of infections, resulting in smoother estimates of the infection curve. In this case, as in the renewal equation model there are two alternative formulations available. The default uses an approximate zero-mean GP for the differences between modelled infections and the initial estimate,
Alternatively, one can use is an approximate zero-mean Gaussian Process (GP) for the first differences in time on the log scale,
with
More details on the mathematical form of the Gaussian process approximation are given in the Gaussian Process implementation details vignette.
As for the renewal equation model, the Gaussian process can be replaced by a random walk of arbitrary length .
When using a fixed shift from infections to reported cases there is no process model and so is used as the estimated infection curve (potentially scaled to take into account underreporting, see section Delays and scaling).
Delays and scaling
If infections are observed with a delay (for example, the incubation period if based on symptomatic cases, and any delay from onset to report), they are convolved in the model to infections at the time scale of observations using delay distributions (with lognormal and gamma parameterisations available) , scaled by an underreporting factor (which is 1 if all infections are observed). This model can be defined mathematically as follows,
where is the combined discrete distribution of delays (with discretised gamma or discretised log normal distributions available as options) with mean (or log mean in the case of lognormal distributions) , standard deviation (or log standard deviation in the case of lognormal distributions) and maximum .
Delays can either be specified as coming from a distribution with uncertainty by giving mean and standard deviations of normal priors, weighted by the number of observations and truncated to be positive where relevant for the given distribution; or they can be specified as the parameters of a fixed distribution, or as fixed values.
The scaling factor represents the proportion of cases that are ultimately reported, which by default is set to 1 (i.e. no underreporting) but can instead be set to come from a normal prior with given mean and standard deviation, truncated to be between 0 and 1.
Observation model
The modelled counts are related to observations . By default this is assumed to follow a negative binomial distribution with overdispersion (alternatively it can be modelled as a Poisson, in which case is not used):
where is a daily reporting effect of cyclicity . If this corresponds to a day-of-the-week reporting effect.
This model uses the following priors for the observation model,
Truncation
The model supports counts that are right-truncated, i.e. reported with a delay leading to recent counts being subject to future upwards revision. Denoting the final truncated counts with they are obtained form the final modelled cases by applying a given discrete truncation distribution with cumulative mass function :
If truncation is applied, the modelled cases are replaced by the truncated counts before confronting them with observations as described above.