This is our evaluation of forecasting models and individual forecasters who have contributed forecasts of Covid-19 case and death numbers in Germany and Poland. These forecasts were submitted to the German and Polish Forecast Hub each week. Currently, we submit an unweighted mean ensemble to the Forecast Hub, but this will likely be changed in the future. We will transition to submitting forecasts to the European Forecast Hub in the future. The evaluations are our own and not authorised by the German Forecast Hub team. We cannot rule out mistakes and the analyses are subject to change.
Note that forecasts have now shifted to the European Forecast Hub. You can find our evaluation for these forecasts here.
If you have questions or want to give feedback, please create an issue on our github repository
You can register to become a forecaster here.
Forecaster ranking
Here is an overall ranking of all forecasters. The ranking is made according to relative skill. Relative skill is calculated by looking at all pairwise comparisons between forecasters in terms of the weighted interval score (WIS). See below for a more detailed explanation of the scoring metrics used. ‘Overall’ shows the complete ranking, ‘latest’ only spans the last 5-6 weeks of data. ‘Detailed’ represents the full data set that you can download for your own analysis.
The following metrics are used:
- Relative skill is a metric based on the weighted interval score (WIS) that is using a ‘pairwise comparison tournament’. All pairs of forecasters are compared against each other in terms of the weighted interval score. The mean score of both models based on the set of common targets for which both models have made a prediction are calculated to obtain mean score ratios. The relative skill is the geometric mean of these mean score ratios. Smaller values are better and a value smaller than one means that the model beats the average forecasting model.
- The weighted interval score is a proper scoring rule (meaning you can’t cheat it) suited to scoring forecasts in an interval format. It has three components: sharpness, underprediction and overprediction. Sharpness is the width of your prediction interval. Over- and underprediction only come into play if the prediction interval does not cover the true value. They are the absolute value of the difference between the upper or lower bound of your prediction interval (depending on whether your forecast is too high or too low).
- coverage deviation is the average difference between nominal and empirical interval coverage. Say your 50 percent prediction interval covers only 20 percent of all true values, then your coverage deviation is 0.5 - 0.2 = -0.3. The coverage deviation value in the table is calculated by averaging over the coverage deviation calculated for all possible prediction intervals. If the value is negative you have covered less then you should. If it is positve, then your forecasts could be a little more confident.
- bias is a measure between -1 and 1 that expresses your tendency to underpredict (-1) or overpredict (1). In contrast to the over- and underprediction components of the WIS it is bound between -1 and 1 and cannot go to infinity. It is therefore less susceptible to outliers.
- aem is the absolute error of your median forecasts. A high aem means your median forecasts tend to be far away from the true values.
Forecast visualisation
This is a visualisation of all forecasts made so far.
2021-04-05
Germany
case
death
Poland
case
death
2021-03-29
Germany
case
death
Poland
case
death
2021-03-22
Germany
case
death
Poland
case
death
2021-03-15
Germany
case
death
Poland
case
death
2021-03-08
Germany
case
death
Poland
case
death
2021-03-01
Germany
case
death
Poland
case
death
2021-02-22
Germany
case
death
Poland
case
death
2021-02-15
Germany
case
death
Poland
case
death
2021-02-08
Germany
case
death
Poland
case
death
2021-02-01
Germany
case
death
Poland
case
death
2021-01-25
Germany
case
death
Poland
case
death
2021-01-18
Germany
case
death
Poland
case
death
2021-01-11
Germany
case
death
Poland
case
death
2021-01-04
Germany
case
death
Poland
case
death
2020-12-28
Germany
case
death
Poland
case
death
2020-12-21
Germany
case
death
Poland
case
death
2020-12-14
Germany
case
death
Poland
case
death
2020-12-07
Germany
case
death
Poland
case
death
2020-11-30
Germany
case
death
Poland
case
death
2020-11-23
Germany
case
death
Poland
case
death
2020-11-16
Germany
case
death
Poland
case
death
2020-11-09
Germany
case
death
Poland
case
death
2020-11-02
Germany
case
death
Poland
case
death
2020-10-26
Germany
case
death
Poland
case
death
2020-10-19
Germany
case
death
Poland
case
death
2020-10-12
Germany
case
death
Poland
case
death
Ranks over time
This table shows you either your rank among all forecasters or the standardised rank. The standardised rank is computed as (100 - the forecaster percentile rank) among all forecasters for a given target and forecast date. What happens is basically this: Every forecaster gets assigned a rank (1 is the best and the worst equals the number of available forecasts for that date). This rank is then transformed to a scale from 1 to 100 such that 100 is best and 0 is worst. Ranks are determined based on the weighted interval scores.
model rank
standardised model rank
Weighted Interval Score Decomposition
The weighted interval score can be decomposed into three parts: sharpness (the amount of uncertainty around the forecast), overprediction and underprediction. This visualisation gives an impression of the distribution between these three forms of penalties for the different forecasters.
Models and available forecasts
The following graphic gives an overview of the forecasters and models analysed and the number of forecasts they contributed.
Most of the ‘models’ are human forecasters, but some are not:
- EpiExpert-ensemble is the ensemble that is formed as the mean of all human forecasts submitted
- EpiNow2 is an exponential growth model that uses a time-varying Rt trajectory to predict latent infections, and then convolves these infections with estimated delays to observations, via a negative binomial model coupled with a day of the week effect. It makes limited assumptions and is not tuned to the specifities of Covid in Germany and Poland beyond epidemioligical details such as literature estimates of the generation time, incubation period and the population of each area. The method and underlying theory are under active development with more details available [here](https://epiforecasts.io/covid/methods.
- EpiNow2-secondary is an EpiNow2 model that infers deaths from cases by convoluting case numbers with a delay distribution
- Crowd-Rt-Forecast is an ensemble that is constructed by taking human crowd forecasts of Rt and calculating implied case numbers from that using EpiNow2