Skip to contents

The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score probabilistic forecasts against observed values. You can find more information in the paper Evaluating Forecasts with scoringutils in R as well as the Metrics-Vignette and the Scoring forecasts directly Vignette.

The scoringutils package offers convenient automated forecast evaluation in a data.table format (using the function score()), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matriced they can build upon in other applications. In addition it implements a wide range of flexible plots that are able to cover many use cases.

The goal of this package is to provide a tested and reliable collection of metrics that can be used for scoring probabilistic forecasts (forecasts with a full predictive distribution, rather than point forecasts). It has a much stronger focus on convenience than e.g. the scoringRules package, which provides a comprehensive collection of proper scoring rules (also used in scoringutils). In contrast to other packages, scoringutils offers functionality to automatically evaluate forecasts, to visualise scores and to obtain relative scores between models.

Predictions can be handled in various formats: scoringutils can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.

Input formats

Most of the time, the score() function will be able to do the entire evaluation for you. All you need to do is to pass in a data.frame with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see example_quantile for the expected format), based on predictive samples (see example_sample_continuous and example_sample_discrete for the expected format in each case) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below):

Format Required columns
quantile-based ‘observed’, ‘predicted’, ‘quantile’
sample-based ‘observed’, ‘predicted’, ‘sample’
binary ‘observed’, ‘predicted’
pairwise-comparisons additionally a column ‘model’

Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. scoringutils automatically tries to determine the unit of a single forecast, i.e. the combination of existing columns that are able to uniquely identify a single forecast. It uses all existing columns for this, which can sometimes lead to issues. We therefore recommend using the function set_forecast_unit() to determine the forecast unit manually. The function simply drops unneeded columns, while making sure that some necessary, ‘protected columns’ like “predicted” or “observed” are retained.

colnames(example_quantile)
#>  [1] "location"        "target_end_date" "target_type"     "observed"       
#>  [5] "location_name"   "forecast_date"   "quantile_level"  "predicted"      
#>  [9] "model"           "horizon"

set_forecast_unit(
  example_quantile,
  c("location", "target_end_date", "target_type", "horizon", "model")
) %>%
  colnames()
#> [1] "observed"        "quantile_level"  "predicted"       "location"       
#> [5] "target_end_date" "target_type"     "horizon"         "model"

Constructing a forecast object

The function as_forecast() is be used to construct a forecast object and check the input data. It determines the forecast type, creates an object of the appropriate class (forecast_binary, forecast_quantile or forecast_sample), and validates the input data. Objects of class forecast_* have a print method that provides additional information.

head(example_quantile)
#> Key: <location, target_end_date, target_type>
#>    location target_end_date target_type observed location_name forecast_date
#>      <char>          <Date>      <char>    <num>        <char>        <Date>
#> 1:       DE      2021-01-02       Cases   127300       Germany          <NA>
#> 2:       DE      2021-01-02      Deaths     4534       Germany          <NA>
#> 3:       DE      2021-01-09       Cases   154922       Germany          <NA>
#> 4:       DE      2021-01-09      Deaths     6117       Germany          <NA>
#> 5:       DE      2021-01-16       Cases   110183       Germany          <NA>
#> 6:       DE      2021-01-16      Deaths     5867       Germany          <NA>
#>    quantile_level predicted  model horizon
#>             <num>     <int> <char>   <num>
#> 1:             NA        NA   <NA>      NA
#> 2:             NA        NA   <NA>      NA
#> 3:             NA        NA   <NA>      NA
#> 4:             NA        NA   <NA>      NA
#> 5:             NA        NA   <NA>      NA
#> 6:             NA        NA   <NA>      NA
forecast_quantile <- example_quantile %>%
  set_forecast_unit(
    c("model", "location", "target_end_date", "forecast_date",
      "target_type", "horizon")
  ) %>%
  as_forecast()
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.

forecast_quantile
#> Forecast type: quantile
#> Forecast unit:
#> model, location, target_end_date, forecast_date, target_type, and horizon
#> 
#> Key: <location, target_end_date, target_type>
#>        observed quantile_level predicted                model location
#>           <num>          <num>     <int>               <char>   <char>
#>     1:   127300             NA        NA                 <NA>       DE
#>     2:     4534             NA        NA                 <NA>       DE
#>     3:   154922             NA        NA                 <NA>       DE
#>     4:     6117             NA        NA                 <NA>       DE
#>     5:   110183             NA        NA                 <NA>       DE
#>    ---                                                                
#> 20541:       78          0.850       352 epiforecasts-EpiNow2       IT
#> 20542:       78          0.900       397 epiforecasts-EpiNow2       IT
#> 20543:       78          0.950       499 epiforecasts-EpiNow2       IT
#> 20544:       78          0.975       611 epiforecasts-EpiNow2       IT
#> 20545:       78          0.990       719 epiforecasts-EpiNow2       IT
#>        target_end_date forecast_date target_type horizon
#>                 <Date>        <Date>      <char>   <num>
#>     1:      2021-01-02          <NA>       Cases      NA
#>     2:      2021-01-02          <NA>      Deaths      NA
#>     3:      2021-01-09          <NA>       Cases      NA
#>     4:      2021-01-09          <NA>      Deaths      NA
#>     5:      2021-01-16          <NA>       Cases      NA
#>    ---                                                  
#> 20541:      2021-07-24    2021-07-12      Deaths       2
#> 20542:      2021-07-24    2021-07-12      Deaths       2
#> 20543:      2021-07-24    2021-07-12      Deaths       2
#> 20544:      2021-07-24    2021-07-12      Deaths       2
#> 20545:      2021-07-24    2021-07-12      Deaths       2

If you are unsure what your input data should look like, have a look at the example_quantile, example_sample_discrete, example_sample_continuous and example_binary data sets provided in the package.

The output of as_forecast() can later be directly used as input to score(). This is the recommended workflow (however, score() will run as_forecast() internally if this hasn’t happened before).

Note that in the above example some columns contain duplicated information with regards to the forecast unit, e.g. “location” and “location_name”, and can be dropped. If we drop essential information, for example, the “target_type” column, we’ll get an error informing us that the forecasts aren’t uniquely identified any more.

example_quantile %>%
  set_forecast_unit(
    c("location", "target_end_date",
      "forecast_date", "model", "horizon")
  ) %>%
  as_forecast()
#> Error: Assertion on 'data' failed: There are instances with more than one forecast for the same target. This can't be right and needs to be resolved. Maybe you need to check the unit of a single forecast and add missing columns? Use the function get_duplicate_forecasts() to identify duplicate rows.

The function get_duplicate_forecasts() may help to investigate the issue. When filtering for only a single quantile of the EuroCOVIDhub-ensemble, we can see that there are indeed two forecasts for every date, location and horizon.

duplicates <- example_quantile %>%
  set_forecast_unit(
    c("location", "target_end_date",
      "forecast_date", "model", "horizon")
  ) %>%
  get_duplicate_forecasts()

duplicates[quantile_level == 0.5 & model == "EuroCOVIDhub-ensemble", ] %>%
  head()
#> Key: <location, target_end_date>
#>    observed quantile_level predicted location target_end_date forecast_date
#>       <num>          <num>     <int>   <char>          <Date>        <Date>
#> 1:   106987            0.5    119258       DE      2021-05-08    2021-05-03
#> 2:     1582            0.5      1568       DE      2021-05-08    2021-05-03
#> 3:    64985            0.5    110716       DE      2021-05-15    2021-05-03
#> 4:    64985            0.5     92649       DE      2021-05-15    2021-05-10
#> 5:     1311            0.5      1521       DE      2021-05-15    2021-05-03
#> 6:     1311            0.5      1422       DE      2021-05-15    2021-05-10
#>                    model horizon
#>                   <char>   <num>
#> 1: EuroCOVIDhub-ensemble       1
#> 2: EuroCOVIDhub-ensemble       1
#> 3: EuroCOVIDhub-ensemble       2
#> 4: EuroCOVIDhub-ensemble       1
#> 5: EuroCOVIDhub-ensemble       2
#> 6: EuroCOVIDhub-ensemble       1

Showing available forecasts

The function get_forecast_counts() may also be helpful to determine where forecasts are available. Using the by argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run

get_forecast_counts(forecast_quantile, by = c("model", "target_type"))
#> Key: <model, target_type>
#>                    model target_type count
#>                   <char>      <char> <int>
#> 1: EuroCOVIDhub-baseline       Cases   128
#> 2: EuroCOVIDhub-baseline      Deaths   128
#> 3: EuroCOVIDhub-ensemble       Cases   128
#> 4: EuroCOVIDhub-ensemble      Deaths   128
#> 5:       UMass-MechBayes       Cases     0
#> 6:       UMass-MechBayes      Deaths   128
#> 7:  epiforecasts-EpiNow2       Cases   128
#> 8:  epiforecasts-EpiNow2      Deaths   119

We see that ‘epiforecasts-EpiNow2’ has some missing forecasts for the deaths forecast target and that UMass-MechBayes has no case forecasts.

This information can also be visualised using plot():

forecast_quantile %>%
  get_forecast_counts(by = c("model", "forecast_date", "target_type")) %>%
  plot_forecast_counts(x = "forecast_date") +
  facet_wrap(~ target_type)

Scoring and summarising forecasts

Forecasts can easily be scored using the score() function.

For clarity, we suggest setting the forecast unit explicitly and calling as_forecast() explicitly.

scores <- forecast_quantile %>%
  score()

print(scores, 2)
#>                      model location target_end_date forecast_date target_type
#>                     <char>   <char>          <Date>        <Date>      <char>
#>   1: EuroCOVIDhub-ensemble       DE      2021-05-08    2021-05-03       Cases
#>   2: EuroCOVIDhub-baseline       DE      2021-05-08    2021-05-03       Cases
#>  ---                                                                         
#> 886:  epiforecasts-EpiNow2       IT      2021-07-24    2021-07-05      Deaths
#> 887:  epiforecasts-EpiNow2       IT      2021-07-24    2021-07-12      Deaths
#>      horizon         wis overprediction underprediction dispersion  bias
#>        <num>       <num>          <num>           <num>      <num> <num>
#>   1:       1  7990.85478    2549.869565               0 5440.98522  0.50
#>   2:       1 16925.04696   15275.826087               0 1649.22087  0.95
#>  ---                                                                    
#> 886:       3    19.76261       5.478261               0   14.28435  0.50
#> 887:       2    66.16174      40.608696               0   25.55304  0.90
#>      interval_coverage_50 interval_coverage_90 interval_coverage_deviation
#>                    <lgcl>               <lgcl>                       <num>
#>   1:                 TRUE                 TRUE                  0.05181818
#>   2:                FALSE                FALSE                 -0.40272727
#>  ---                                                                      
#> 886:                 TRUE                 TRUE                  0.05181818
#> 887:                FALSE                 TRUE                 -0.31181818
#>      ae_median
#>          <num>
#>   1:     12271
#>   2:     25620
#>  ---          
#> 886:        26
#> 887:       108

The function score() returns unsumarised scores, which in most cases is not what the user wants. It returns a single score per forecast (as determined by the forecast unit).

A second function, summarise_scores() takes care of summarising these scores to the level specified by the user. The by argument can be used to define the level of summary. By default, by = model and one score per model is returned. Through the by argument we can specify what unit of summary we want. We can also call sumarise_scores() multiple tines, e.g to round your outputs by specifying e.g. signif() as a summary function.

scores %>%
  summarise_scores(by = c("model", "target_type")) %>%
  summarise_scores(fun = signif, digits = 2) %>%
  kable()
model wis overprediction underprediction dispersion bias interval_coverage_50 interval_coverage_90 interval_coverage_deviation ae_median
EuroCOVIDhub-ensemble 18000 10000.0 4200.0 3700 -0.0560 0.39 0.80 -0.100 24000
EuroCOVIDhub-ensemble 41 7.1 4.1 30 0.0730 0.88 1.00 0.200 53
EuroCOVIDhub-baseline 28000 14000.0 10000.0 4100 0.0980 0.33 0.82 -0.120 38000
EuroCOVIDhub-baseline 160 66.0 2.1 91 0.3400 0.66 1.00 0.120 230
epiforecasts-EpiNow2 21000 12000.0 3300.0 5700 -0.0790 0.47 0.79 -0.070 28000
epiforecasts-EpiNow2 67 19.0 16.0 32 -0.0051 0.42 0.91 -0.045 100
UMass-MechBayes 53 9.0 17.0 27 -0.0220 0.46 0.88 -0.025 78

Scoring point forecasts

Point forecasts can be scored in a separate format.

suppressMessages(score(as_forecast(example_point))) %>%
  summarise_scores(by = "model", na.rm = TRUE)
#>                    model    ae_point     se_point       ape
#>                   <char>       <num>        <num>     <num>
#> 1: EuroCOVIDhub-ensemble 12077.10156 1.945118e+09 0.2996696
#> 2: EuroCOVIDhub-baseline 19353.42969 2.883446e+09 0.7383936
#> 3:  epiforecasts-EpiNow2 14521.10526 2.680928e+09 0.3715552
#> 4:       UMass-MechBayes    78.47656 1.170976e+04 0.2823206

Adding empirical coverage

For quantile-based forecasts we are often interested in specific coverage-levels, for example, what percentage of true values fell between all 50% or the 90% prediction intervals. We can add this information using the function get_coverage(). This function also requires a by argument which defines the level of grouping for which the percentage of true values covered by certain prediction intervals is computed.

score(as_forecast(example_quantile)) %>%
  summarise_scores(by = c("model", "target_type")) %>%
  summarise_scores(fun = signif, digits = 2)
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#>                    model   wis overprediction underprediction dispersion
#>                   <char> <num>          <num>           <num>      <num>
#> 1: EuroCOVIDhub-ensemble 18000        10000.0          4200.0       3700
#> 2: EuroCOVIDhub-ensemble    41            7.1             4.1         30
#> 3: EuroCOVIDhub-baseline 28000        14000.0         10000.0       4100
#> 4: EuroCOVIDhub-baseline   160           66.0             2.1         91
#> 5:  epiforecasts-EpiNow2 21000        12000.0          3300.0       5700
#> 6:  epiforecasts-EpiNow2    67           19.0            16.0         32
#> 7:       UMass-MechBayes    53            9.0            17.0         27
#>       bias interval_coverage_50 interval_coverage_90
#>      <num>                <num>                <num>
#> 1: -0.0560                 0.39                 0.80
#> 2:  0.0730                 0.88                 1.00
#> 3:  0.0980                 0.33                 0.82
#> 4:  0.3400                 0.66                 1.00
#> 5: -0.0790                 0.47                 0.79
#> 6: -0.0051                 0.42                 0.91
#> 7: -0.0220                 0.46                 0.88
#>    interval_coverage_deviation ae_median
#>                          <num>     <num>
#> 1:                      -0.100     24000
#> 2:                       0.200        53
#> 3:                      -0.120     38000
#> 4:                       0.120       230
#> 5:                      -0.070     28000
#> 6:                      -0.045       100
#> 7:                      -0.025        78

Adding relative scores

In order to better compare models against each other we can use relative scores which are computed based on pairwise comparisons (see details below). Relative scores can be added to the evaluation using the function summarise_scores(). This requires a column called ‘model’ to be present. Pairwise comparisons are computed according to the grouping specified in by: essentially, the data.frame with all scores gets split into different data.frames according to the values specified in by and relative scores are computed for every individual group separately. The baseline argument allows us to specify a baseline that can be used to scale relative scores (all scores are divided by the baseline relative score). For example, to obtain relative scores separately for different forecast targets, we can run

score(as_forecast(example_quantile)) %>%
  add_relative_skill(
    by = c("model", "target_type"),
    baseline = "EuroCOVIDhub-ensemble"
  ) %>%
  summarise_scores(by = c("model", "target_type"))
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#> Key: <model, target_type>
#>                    model target_type         wis overprediction underprediction
#>                   <char>      <char>       <num>          <num>           <num>
#> 1: EuroCOVIDhub-baseline       Cases 28483.57465   14096.100883    10284.972826
#> 2: EuroCOVIDhub-baseline      Deaths   159.40387      65.899117        2.098505
#> 3: EuroCOVIDhub-ensemble       Cases 17943.82383   10043.121943     4237.177310
#> 4: EuroCOVIDhub-ensemble      Deaths    41.42249       7.138247        4.103261
#> 5:       UMass-MechBayes      Deaths    52.65195       8.978601       16.800951
#> 6:  epiforecasts-EpiNow2       Cases 20831.55662   11906.823030     3260.355639
#> 7:  epiforecasts-EpiNow2      Deaths    66.64282      18.892583       15.893314
#>    dispersion        bias interval_coverage_50 interval_coverage_90
#>         <num>       <num>                <num>                <num>
#> 1: 4102.50094  0.09796875            0.3281250            0.8203125
#> 2:   91.40625  0.33906250            0.6640625            1.0000000
#> 3: 3663.52458 -0.05640625            0.3906250            0.8046875
#> 4:   30.18099  0.07265625            0.8750000            1.0000000
#> 5:   26.87239 -0.02234375            0.4609375            0.8750000
#> 6: 5664.37795 -0.07890625            0.4687500            0.7890625
#> 7:   31.85692 -0.00512605            0.4201681            0.9075630
#>    interval_coverage_deviation   ae_median wis_relative_skill
#>                          <num>       <num>              <num>
#> 1:                 -0.11721591 38473.60156          1.2947445
#> 2:                  0.12142045   233.25781          2.2958723
#> 3:                 -0.10230114 24101.07031          0.8156514
#> 4:                  0.20380682    53.13281          0.5966310
#> 5:                 -0.02488636    78.47656          0.7475873
#> 6:                 -0.06963068 27923.81250          0.9469157
#> 7:                 -0.04520244   104.74790          0.9765276
#>    wis_scaled_relative_skill
#>                        <num>
#> 1:                  1.587375
#> 2:                  3.848060
#> 3:                  1.000000
#> 4:                  1.000000
#> 5:                  1.253014
#> 6:                  1.160932
#> 7:                  1.636736

Visualising scores

Score heatmap

We can also summarise one particular metric across different categories using a simple heatmap:

score(as_forecast(example_sample_continuous)) %>%
  summarise_scores(by = c("model", "location", "target_type")) %>%
  plot_heatmap(x = "location", metric = "bias") +
  facet_wrap(~ target_type)
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.

Weighted interval score components

The weighted interval score can be split up into three components: Over-prediction, under-prediction and dispersion. These can be visualised separately in the following way:

score(as_forecast(example_quantile)) %>%
  summarise_scores(by = c("target_type", "model")) %>%
  plot_wis() +
  facet_wrap(~ target_type, scales = "free")
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.

Calibration

Calibration is a measure statistical consistency between the forecasts and observed values. The most common way of assessing calibration (more precisely: probabilistic calibration) are PIT histograms. The probability integral transform (PIT) is equal to the cumulative distribution function of a forecast evaluated at the observed value. Ideally, pit values should be uniformly distributed after the transformation.

We can compute pit values as such:

example_sample_continuous %>%
  as_forecast() %>%
  get_pit(by = "model")
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#>                      model pit_value
#>                     <char>     <num>
#>   1: EuroCOVIDhub-baseline     0.025
#>   2: EuroCOVIDhub-baseline     0.525
#>   3: EuroCOVIDhub-baseline     0.000
#>   4: EuroCOVIDhub-baseline     0.000
#>   5: EuroCOVIDhub-baseline     0.200
#>  ---                                
#> 883:       UMass-MechBayes     0.950
#> 884:       UMass-MechBayes     0.500
#> 885:       UMass-MechBayes     0.100
#> 886:       UMass-MechBayes     0.450
#> 887:       UMass-MechBayes     0.100

And visualise the results as such:

example_sample_continuous %>%
  as_forecast() %>%
  get_pit(by = c("model", "target_type")) %>%
  plot_pit() +
  facet_grid(model ~ target_type)
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.

Similarly for quantile-based forecasts:

example_quantile[quantile_level %in% seq(0.1, 0.9, 0.1), ] %>%
  as_forecast() %>%
  get_pit(by = c("model", "target_type")) %>%
  plot_pit() +
  facet_grid(model ~ target_type)

Pairwise comparisons

Relative scores for different models can be computed using pairwise comparisons, a sort of pairwise tournament where all combinations of two models are compared against each other based on the overlapping set of available forecasts common to both models. Internally, a ratio of the mean scores of both models is computed. The relative score of a model is then the geometric mean of all mean score ratios which involve that model. When a baseline is provided, then that baseline is excluded from the relative scores for individual models (which therefore differ slightly from relative scores without a baseline) and all relative scores are scaled by (i.e. divided by) the relative score of the baseline model.

In scoringutils, pairwise comparisons can be made in two ways: Through the standalone function get_pairwise_comparisons() or from within summarise_scores() which simply adds relative scores to an existing set of scores.

forecast_quantile %>%
  score() %>%
  get_pairwise_comparisons(by = "model", baseline = "EuroCOVIDhub-baseline")
#>                     model       compare_against mean_scores_ratio         pval
#>                    <char>                <char>             <num>        <num>
#>  1: EuroCOVIDhub-baseline  epiforecasts-EpiNow2         1.3703452 9.164893e-18
#>  2: EuroCOVIDhub-baseline       UMass-MechBayes         3.0275019 2.627464e-20
#>  3: EuroCOVIDhub-baseline EuroCOVIDhub-ensemble         1.5925819 2.608666e-32
#>  4: EuroCOVIDhub-baseline EuroCOVIDhub-baseline         1.0000000 1.000000e+00
#>  5: EuroCOVIDhub-ensemble EuroCOVIDhub-baseline         0.6279112 2.608666e-32
#>  6: EuroCOVIDhub-ensemble       UMass-MechBayes         0.7867229 1.244731e-04
#>  7: EuroCOVIDhub-ensemble  epiforecasts-EpiNow2         0.8606607 1.881520e-02
#>  8: EuroCOVIDhub-ensemble EuroCOVIDhub-ensemble         1.0000000 1.000000e+00
#>  9:       UMass-MechBayes EuroCOVIDhub-ensemble         1.2710955 1.244731e-04
#> 10:       UMass-MechBayes  epiforecasts-EpiNow2         0.7439673 7.253878e-03
#> 11:       UMass-MechBayes EuroCOVIDhub-baseline         0.3303053 2.627464e-20
#> 12:       UMass-MechBayes       UMass-MechBayes         1.0000000 1.000000e+00
#> 13:  epiforecasts-EpiNow2       UMass-MechBayes         1.3441452 7.253878e-03
#> 14:  epiforecasts-EpiNow2 EuroCOVIDhub-ensemble         1.1618981 1.881520e-02
#> 15:  epiforecasts-EpiNow2 EuroCOVIDhub-baseline         0.7297431 9.164893e-18
#> 16:  epiforecasts-EpiNow2  epiforecasts-EpiNow2         1.0000000 1.000000e+00
#>         adj_pval wis_relative_skill wis_scaled_relative_skill
#>            <num>              <num>                     <num>
#>  1: 3.665957e-17          1.6032604                 1.0000000
#>  2: 1.313732e-19          1.6032604                 1.0000000
#>  3: 1.565200e-31          1.6032604                 1.0000000
#>  4: 1.000000e+00          1.6032604                 1.0000000
#>  5: 1.565200e-31          0.8074916                 0.5036559
#>  6: 3.734192e-04          0.8074916                 0.5036559
#>  7: 1.881520e-02          0.8074916                 0.5036559
#>  8: 1.000000e+00          0.8074916                 0.5036559
#>  9: 3.734192e-04          0.7475873                 0.4662919
#> 10: 1.450776e-02          0.7475873                 0.4662919
#> 11: 1.313732e-19          0.7475873                 0.4662919
#> 12: 1.000000e+00          0.7475873                 0.4662919
#> 13: 1.450776e-02          1.0332277                 0.6444541
#> 14: 1.881520e-02          1.0332277                 0.6444541
#> 15: 3.665957e-17          1.0332277                 0.6444541
#> 16: 1.000000e+00          1.0332277                 0.6444541
forecast_quantile %>%
  score() %>%
  summarise_scores(by = "model") %>%
  add_relative_skill(baseline = "EuroCOVIDhub-baseline")
#> Warning: ! `by` is set to 'model', which is also the unit of a single forecast. This
#>   doesn't look right.
#>  All relative skill scores will be equal to 1.
#> Key: <model>
#>                    model         wis overprediction underprediction dispersion
#>                   <char>       <num>          <num>           <num>      <num>
#> 1: EuroCOVIDhub-baseline 14321.48926    7081.000000      5143.53567 2096.95360
#> 2: EuroCOVIDhub-ensemble  8992.62316    5025.130095      2120.64029 1846.85278
#> 3:       UMass-MechBayes    52.65195       8.978601        16.80095   26.87239
#> 4:  epiforecasts-EpiNow2 10827.40786    6179.439535      1697.23411 2950.73422
#>           bias interval_coverage_50 interval_coverage_90
#>          <num>                <num>                <num>
#> 1:  0.21851562            0.4960938            0.9101562
#> 2:  0.00812500            0.6328125            0.9023438
#> 3: -0.02234375            0.4609375            0.8750000
#> 4: -0.04336032            0.4453441            0.8461538
#>    interval_coverage_deviation   ae_median wis_relative_skill
#>                          <num>       <num>              <num>
#> 1:                 0.002102273 19353.42969                  1
#> 2:                 0.050752841 12077.10156                  1
#> 3:                -0.024886364    78.47656                  1
#> 4:                -0.057861612 14521.10526                  1
#>    wis_scaled_relative_skill
#>                        <num>
#> 1:                         1
#> 2:                         1
#> 3:                         1
#> 4:                         1

If using the get_pairwise_comparisons() function, we can also visualise pairwise comparisons by showing the mean score ratios between models. By default, smaller values are better and the model we care about is showing on the y axis on the left, while the model against it is compared is shown on the x-axis on the bottom. In the example above, the EuroCOVIDhub-ensemble performs best (it only has values smaller 1), while the EuroCOVIDhub-baseline performs worst (and only has values larger than 1). For cases, the UMass-MechBayes model is of course excluded as there are no case forecasts available and therefore the set of overlapping forecasts is empty.

forecast_quantile %>%
  score() %>%
  get_pairwise_comparisons(by = c("model", "target_type")) %>%
  plot_pairwise_comparisons() +
  facet_wrap(~ target_type)

Additional analyses and visualisations

Correlation between scores

It may sometimes be interesting to see how different scores correlate with each other. We can examine this using the function correlation(). When dealing with quantile-based forecasts, it is important to call summarise_scorees() before get_correlations() to summarise over quantiles before computing correlations.

forecast_quantile %>%
  score() %>%
  summarise_scores() %>%
  get_correlations()
#>          wis overprediction underprediction  dispersion       bias
#>        <num>          <num>           <num>       <num>      <num>
#> 1: 1.0000000      0.9917884       0.8821058  0.86564717 0.61587946
#> 2: 0.9917884      1.0000000       0.8158534  0.91900505 0.51068021
#> 3: 0.8821058      0.8158534       1.0000000  0.52982259 0.91280769
#> 4: 0.8656472      0.9190050       0.5298226  1.00000000 0.14860643
#> 5: 0.6158795      0.5106802       0.9128077  0.14860643 1.00000000
#> 6: 0.1565567      0.1740382       0.1630289  0.04407105 0.06639491
#> 7: 0.2818348      0.1968480       0.6141229 -0.17841559 0.74137210
#> 8: 0.1364221      0.1123720       0.2954741 -0.12711127 0.30908292
#> 9: 0.9999780      0.9909177       0.8851559  0.86247754 0.62106954
#>    interval_coverage_50 interval_coverage_90 interval_coverage_deviation
#>                   <num>                <num>                       <num>
#> 1:           0.15655670            0.2818348                   0.1364221
#> 2:           0.17403818            0.1968480                   0.1123720
#> 3:           0.16302890            0.6141229                   0.2954741
#> 4:           0.04407105           -0.1784156                  -0.1271113
#> 5:           0.06639491            0.7413721                   0.3090829
#> 6:           1.00000000            0.6417542                   0.9466440
#> 7:           0.64175425            1.0000000                   0.8451292
#> 8:           0.94664396            0.8451292                   1.0000000
#> 9:           0.15565388            0.2861731                   0.1376834
#>    ae_median                      metric
#>        <num>                      <char>
#> 1: 0.9999780                         wis
#> 2: 0.9909177              overprediction
#> 3: 0.8851559             underprediction
#> 4: 0.8624775                  dispersion
#> 5: 0.6210695                        bias
#> 6: 0.1556539        interval_coverage_50
#> 7: 0.2861731        interval_coverage_90
#> 8: 0.1376834 interval_coverage_deviation
#> 9: 1.0000000                   ae_median

Visualising correlations:

Converting to quantile-based forecasts

Different metrics are available for different forecasting formats. In some cases, you may for example have forecasts in a sample-based format, but wish to make use of some of the functionality only available to quantile-based forecasts. For example, you may want to use the decomposition of the weighted interval score, or may like to compute interval coverage values.

You can convert your sample-based forecasts into a quantile-based format using the function sample_to_quantile(). There is, however, one caveat: Quantiles will be calculated based on the predictive samples, which may introduce a bias if the number of available samples is small.

example_sample_discrete %>%
  as_forecast() %>%
  sample_to_quantile(
    quantile_level = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99)
  ) %>%
  as_forecast() %>%
  score()
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#>  Some rows containing NA values may be removed. This is fine if not
#>   unexpected.
#>      location location_name target_end_date target_type forecast_date
#>        <char>        <char>          <Date>      <char>        <Date>
#>   1:       DE       Germany      2021-05-08       Cases    2021-05-03
#>   2:       DE       Germany      2021-05-08       Cases    2021-05-03
#>   3:       DE       Germany      2021-05-08       Cases    2021-05-03
#>   4:       DE       Germany      2021-05-08      Deaths    2021-05-03
#>   5:       DE       Germany      2021-05-08      Deaths    2021-05-03
#>  ---                                                                 
#> 883:       IT         Italy      2021-07-24      Deaths    2021-07-12
#> 884:       IT         Italy      2021-07-24      Deaths    2021-07-05
#> 885:       IT         Italy      2021-07-24      Deaths    2021-07-12
#> 886:       IT         Italy      2021-07-24      Deaths    2021-07-05
#> 887:       IT         Italy      2021-07-24      Deaths    2021-07-12
#>                      model horizon         wis overprediction underprediction
#>                     <char>   <num>       <num>          <num>           <num>
#>   1: EuroCOVIDhub-ensemble       1  6671.68207   2.110043e+03        0.000000
#>   2: EuroCOVIDhub-baseline       1 18441.13061   1.446498e+04        0.000000
#>   3:  epiforecasts-EpiNow2       1 22042.77309   1.548071e+04        0.000000
#>   4: EuroCOVIDhub-ensemble       1    59.46028   9.682609e+00        0.000000
#>   5: EuroCOVIDhub-baseline       1    75.93215   0.000000e+00        1.195652
#>  ---                                                                         
#> 883: EuroCOVIDhub-baseline       2    60.71862   1.323913e+01        0.000000
#> 884:       UMass-MechBayes       3     5.79650   3.695652e-01        0.000000
#> 885:       UMass-MechBayes       2    26.18112   1.891304e+01        0.000000
#> 886:  epiforecasts-EpiNow2       3    25.13032   3.478261e-01        0.000000
#> 887:  epiforecasts-EpiNow2       2    44.07931   2.410435e+01        0.000000
#>       dispersion  bias interval_coverage_50 interval_coverage_90
#>            <num> <num>               <lgcl>               <lgcl>
#>   1: 4561.638591  0.60                FALSE                 TRUE
#>   2: 3976.145828  0.98                FALSE                FALSE
#>   3: 6562.060048  0.90                FALSE                 TRUE
#>   4:   49.777674  0.30                 TRUE                 TRUE
#>   5:   74.736502 -0.10                 TRUE                 TRUE
#>  ---                                                            
#> 883:   47.479491  0.50                 TRUE                 TRUE
#> 884:    5.426935  0.20                 TRUE                 TRUE
#> 885:    7.268076  0.90                FALSE                 TRUE
#> 886:   24.782489  0.10                 TRUE                 TRUE
#> 887:   19.974959  0.70                FALSE                 TRUE
#>      interval_coverage_deviation ae_median
#>                            <num>     <num>
#>   1:                 -0.03909091    9414.0
#>   2:                 -0.49363636   29379.0
#>   3:                 -0.31181818   36513.0
#>   4:                  0.23363636      77.0
#>   5:                  0.41545455      27.5
#>  ---                                      
#> 883:                  0.05181818      58.5
#> 884:                  0.32454545       5.0
#> 885:                 -0.31181818      42.5
#> 886:                  0.41545455       8.0
#> 887:                 -0.13000000      73.5

Lower level functions

Most of these metrics are available as lower level functions and extensively documented. Have a look at the help files to understand these better.