Skip to contents

In the following sections we provide methodological and implementation details for the nowcasting framework implemented in epinowcast. Our approach is an extension of that proposed by Günther et al.[1] which was itself an extension of the model proposed by Höhle and Heiden[2] and implemented in the surveillance R package[3]. Compared to the model proposed in Günther et al.[1], epinowcast adds an optional parametric assumption for the underlying delay from occurrence to report, support for jointly nowcasting multiple related datasets, a flexible formula interface allowing for the specification of a large range of models, and an efficient implementation in stan which makes use of sparse design matrices and within chain parallelisation to reduce runtimes.[4,5]

Decomposition into expected final notifications and report delay components

We follow the approach of Höhle and Heiden[2] and consider the distribution of notifications (\(n_{gtd}\)) by time of occurrence (\(t\)) and reporting delay (\(d\)) conditional on the final observed count \(N_{gt}\) for each dataset (\(g\)) such that,

\[\begin{equation} N_{gt} = \sum_{d=0}^{D} n_{gtd} \end{equation}\]

where \(D\) represents the maximum delay between time of occurrence and time of report which in theory could be infinite but in practice we set to a value in order to make the model identifiable and computationally feasible. This formulation means that \(n_{gtd} \mid N_{gt}\) is multinomial with a probability vector (\(p_{gtd}\)) of length \(D\) for each \(t\) and \(g\) that needs to be estimated at the same time as estimating the expected number of final notifications \(\mathbb{E}[N_{gt}] = \lambda_{gt}\).

An alternative approach, not explored here, is to consider each \(n_{gtd}\) independently at which point the model can be defined as a standard regression with the appropriate observation model and adjustment for reporting delay (i.e it becomes a Poisson or Negative Binomial regression)[6]. An implementation of this approach is available in Bastos et al.[6]. More work needs to be done to evaluate which of these approaches produces more accurate nowcasts for epidemiological count data.

Expected final notifications

Here we follow the approach of Günther et al.[1] and specify the model for expected final notifications as a first order random walk. This model can in principle be any model such as a more complex time-series approach, a gaussian process, or a mechanistic or semi-mechanistic compartmental model. Extending the flexibility of this model is an area of further work as is evaluating the benefits and tradeoffs of more complex approaches.

\[\begin{align} \log (\lambda_{gt}) &\sim \text{Normal}\left(\log (\lambda_{gt-1}) , \sigma^{\lambda}_{g} \right) \\ \log (\lambda_{g0}) &\sim \text{Normal}\left(\log (N_{g0}), 1 \right) \\ \sigma^{\lambda}_{g} &\sim \text{Half-Normal}\left(0, 1\right) \end{align}\]

Delay distribution

Again following the approach of Günther et al.[1] we define the delay distribution (\(p_{gtd}\)) as a discrete time hazard model (\(h_{gtd} =\text{P} \left(\text{delay}=d|\text{delay} \geq d, W_{gtd}\right)\)) but we extend this model to decompose \(W_{gtd}\) into 3 components: hazard derived from a parametric delay distribution (\(\gamma_{gtd}\)) dependent on covariates at the date of occurrence, hazard not derived from a parametric distribution (\(\delta_{gtd}\)) dependent on covariates at the date of occurrence, and hazard dependent on covariates referenced to the date of report (\(\epsilon_{gtd}\)).

For first component (\(\gamma_{gtd}\)) we assume that the probability of reporting \(p^{\prime}_{gtd}\) on a given date given follow a parametric distribution (in the baseline case a discretised log normal distribution) with the log mean and log standard deviation being defined using an intercept and arbitrary shared, reference date indexed, covariates with fixed (\(\alpha_{i}\)) and random (\(\beta_{i}\)) coefficients,

\[\begin{align} p^{\prime}_{gtd} &\sim \text{LogNormal} \left(\mu_{gt}, \upsilon_{gt} \right) \\ \mu_{gt} &= \mu_0 + \alpha_{\mu} X_{\gamma} + \beta_{\mu} Z_{\gamma} \\ \upsilon_{gt} &= \text{exp} \left( \upsilon_0 + \alpha_{\upsilon} X_{\gamma} + \beta_{\upsilon} Z_{\gamma} \right) \end{align}\]

The parametric logit hazard (probability of report on a given date conditional on not already having reported) for this component of the model is then,

\[\begin{equation} \gamma_{gtd} = \text{logit} \left(\frac{p^{\prime}_{gtd}}{\left(1 -\sum^{d-1}_{d^{\prime}=0} p^{\prime}_{gtd^{\prime}} \right)} \right) \end{equation}\]

The non-distributional logit hazarad components for the date of occurrence and report are then again defined using an intercept and arbitrary shared covariates with fixed (\(\alpha_{i}\)) and random (\(\beta_{i}\)) coefficients.

\[\begin{align} \delta_{gtd} &= \mu_0 + \alpha_{\delta} X_{\delta} + \beta_{\delta} Z_{\delta} \\ \epsilon_{gtd} &= \epsilon_0 + \alpha_{\epsilon} X_{\epsilon} + \beta_{\epsilon} Z_{\epsilon} \end{align}\]

The overall hazard for each group, occurrence time, and delay is then,

\[\begin{equation} \text{logit} (h_{gtd}) = \gamma_{gtd} + \delta_{gtd} + \upsilon_{gtd},\ h_{gtD} = 1 \end{equation}\]

where the hazard on the final day has been assumed to be 1 in order to enforce the constraint that all reported observations are reported within the specified maximum delay. The probability of report for a given delay, occurrence date, and group is then as follows,

\[\begin{equation} p_{gt0} = h_{gt0},\ p_{gtd} = \left(1 -\sum^{d-1}_{d^{\prime}=0} p_{gtd^{\prime}} \right) \times h_{gtd} \end{equation}\]

All (\(\alpha_{i}\)) and random (\(\beta_{i}\)) coefficients have standard normal priors by default with standard half-normal priors for pooled standard deviations.

Observation model and nowcast

Expected notifications by time of occurrence (\(t\)) and reporting delay can now be found by multiplying expected final notifications for each \(t\) with the probability of reporting for each day of delay (\(p_{gtd}\)). We then assume a negative binomial observation model with a joint overdispersion parameter (with a 1 over square root standard half normal prior) and produce a nowcast of final observed notifications at each occurrence time by summing posterior estimates for each observed notification for that occurrence time.

\[\begin{align} n_{gtd} \mid \lambda_{gt},p_{gtd} &\sim \text{NB}(\lambda_{gt} \times p_{gtd}, \phi),\ t=1,...,T. \\ \frac{1}{\phi^2} &\sim \text{Half-Normal}(0, 1) \\ N_{gt} &= \sum_{d=0}^{D} n_{gtd} \end{align}\]

Implementation

The model is implemented in stan using cmdstanr with no defaults altered[4,5]. Optional within chain parallelisation is available across dates of occurrence to reduce runtimes. Sparse design matrices have been used for all covariates to limit the number of probability mass functions that need to be calculated. epinowcast incorporates additional functionality written in R[7] to enable plotting nowcasts and posterior predictions, summarising nowcasts, and scoring them using scoringutils[8]. All functionality is modular allowing users to extend and alter the underlying model whilst continuing to use the package framework.

References

1. Günther, F., Bender, A., Katz, K., Küchenhoff, H., & Höhle, M. (2021). Nowcasting the COVID-19 pandemic in Bavaria. Biometrical Journal, 63(3), 490–502. https://doi.org/10.1002/bimj.202000112

2. Höhle, M., & Heiden, M. an der. (2014). Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011. Biometrics, 70(4), 993–1002. https://doi.org/10.1111/biom.12194

3. Meyer, S., Held, L., & Höhle, M. (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. Journal of Statistical Software, 77(11), 1–55. https://doi.org/10.18637/jss.v077.i11

4. Team, S. D. (2021). Stan modeling language users guide and reference manual, 2.28.1.

5. Gabry, J., & Češnovar, R. (2021). Cmdstanr: R interface to ’cmdstan’.

6. Bastos, L. S., Economou, T., Gomes, M. F. C., Villela, D. A. M., Coelho, F. C., Cruz, O. G., Stoner, O., Bailey, T., & Codeço, C. T. (2019). A modelling approach for correcting reporting delays in disease surveillance data. Statistics in Medicine, 38(22), 4363–4377. https://doi.org/10.1002/sim.8303

7. R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

8. Bosse, N. (2020). Scoringutils: A collection of proper scoring rules and metrics to assess predictions. https://github.com/epiforecasts/scoringutils