Scoring forecasts directly
Nikos Bosse
2024-03-29
Source:vignettes/scoring-forecasts-directly.Rmd
scoring-forecasts-directly.Rmd
A variety of metrics and scoring rules can also be accessed directly
through the scoringutils
package.
The following gives an overview of (most of) the implemented metrics.
Bias
The function bias
determines bias from predictive
Monte-Carlo samples, automatically recognising whether forecasts are
continuous or integer valued.
For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]
where \(P_t\) is the empirical cumulative distribution function of the prediction for the observed value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).
For integer valued forecasts, Bias is measured as
\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]
to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.
## integer valued forecasts
observed <- rpois(30, lambda = 1:30)
predicted <- replicate(200, rpois(n = 30, lambda = 1:30))
bias_sample(observed, predicted)
#> [1] 0.665 -0.480 -0.135 0.645 0.970 -0.125 -0.060 0.555 -0.500 -0.640
#> [11] 0.770 0.130 0.205 0.460 0.615 0.700 0.965 -0.395 0.620 0.200
#> [21] -0.295 0.735 0.040 -0.225 -0.485 -0.980 0.215 0.395 0.190 0.645
## continuous forecasts
observed <- rnorm(30, mean = 1:30)
predicted <- replicate(200, rnorm(30, mean = 1:30))
bias_sample(observed, predicted)
#> [1] 0.61 0.06 0.14 -0.33 -0.40 -0.99 0.67 -0.95 0.20 0.80 -0.28 0.51
#> [13] -0.87 -0.72 0.73 -0.89 -0.94 -0.25 0.22 0.64 -0.11 0.68 0.61 0.49
#> [25] 0.94 0.80 -0.13 -0.01 0.34 0.29
Sharpness
Sharpness is the ability of the model to generate predictions within a narrow range. It is a data-independent measure, and is purely a feature of the forecasts themselves.
Sharpness / dispersion of predictive samples corresponding to one
single observed value is measured as the normalised median of the
absolute deviation from the median of the predictive samples. For
details, see ?stats::mad
predicted <- replicate(200, rpois(n = 30, lambda = 1:30))
mad_sample(predicted = predicted)
#> [1] 1.4826 1.4826 1.4826 1.4826 2.9652 2.9652 2.9652 2.9652 2.9652 2.9652
#> [11] 2.9652 4.4478 2.9652 2.9652 4.4478 4.4478 3.7065 4.4478 4.4478 4.4478
#> [21] 4.4478 4.4478 4.4478 4.4478 5.9304 4.4478 4.4478 5.9304 5.9304 4.4478
Calibration
Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.
Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,
\[u_t = F_t (x_t)\]
where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.
In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:
\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]
where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.
The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.
As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should also be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.
Continuous Ranked Probability Score (CRPS)
Wrapper around the crps_sample()
function from the
scoringRules
package. For more information look at the
manuals from the scoringRules
package. The function can be
used for continuous as well as integer valued forecasts. Smaller values
are better.
observed <- rpois(30, lambda = 1:30)
predicted <- replicate(200, rpois(n = 30, lambda = 1:30))
crps_sample(observed, predicted)
#> [1] 0.237475 0.623275 0.538750 1.548750 0.825975 1.981925 1.587925 0.799075
#> [9] 0.739325 1.915850 0.968400 3.732725 0.915500 1.277550 2.820400 1.387775
#> [17] 1.149350 1.115050 6.234575 1.331500 1.926875 2.619450 2.003875 2.269400
#> [25] 6.307900 4.429525 1.944325 3.766775 1.774575 1.636525
Dawid-Sebastiani Score (DSS)
Wrapper around the dss_sample()
function from the
scoringRules
package. For more information look at the
manuals from the scoringRules
package. The function can be
used for continuous as well as integer valued forecasts. Smaller values
are better.
observed <- rpois(30, lambda = 1:30)
predicted <- replicate(200, rpois(n = 30, lambda = 1:30))
dss_sample(observed, predicted)
#> [1] 1.019946 1.293652 1.516522 1.390396 1.507416 2.538176 3.992445 2.547948
#> [9] 3.257033 7.200052 4.456814 2.437308 4.305648 2.833647 3.002493 6.538905
#> [17] 2.769782 4.278974 2.723932 5.798373 3.052567 6.490496 3.247760 3.677784
#> [25] 3.343406 4.784797 7.608433 3.952039 3.349231 3.645527
Log Score
Wrapper around the logs_sample()
function from the
scoringRules
package. For more information look at the
manuals from the scoringRules
package. The function should
not be used for integer valued forecasts. While Log Scores are in
principle possible for integer valued forecasts they require a kernel
density estimate which is not well defined for discrete values. Smaller
values are better.
observed <- rnorm(30, mean = 1:30)
predicted <- replicate(200, rnorm(n = 30, mean = 1:30))
logs_sample(observed, predicted)
#> [1] 1.7241535 1.0337895 4.1769926 2.4739542 1.0613186 1.3341513 1.1130753
#> [8] 1.0838730 2.6091516 1.0126545 1.3364958 1.3442711 2.2570649 2.0989020
#> [15] 2.8240922 1.4994136 1.0730204 0.9406536 1.5261497 1.0757350 2.7026433
#> [22] 1.0973817 0.9912128 1.1101199 1.3870615 1.5363828 0.9535589 0.9904320
#> [29] 0.8964837 1.1009145
Brier Score
The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the observed outcome will be 1.
The Brier Score is then computed as the mean squared error between the probabilistic prediction and the observed outcome.
\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]
observed <- factor(sample(c(0, 1), size = 30, replace = TRUE))
predicted <- runif(n = 30, min = 0, max = 1)
brier_score(observed, predicted)
#> [1] 0.7488537306 0.3699527389 0.4808164408 0.0272735812 0.0190253058
#> [6] 0.5301718446 0.0000891646 0.9227080386 0.0075959985 0.1566985012
#> [11] 0.1821985142 0.2848574174 0.0309986160 0.0138080537 0.8318239570
#> [16] 0.7515761280 0.6591139830 0.1081313976 0.6602473950 0.2140810210
#> [21] 0.0003576128 0.0423122677 0.1436688756 0.0721284649 0.3144916990
#> [26] 0.3184441517 0.0355875371 0.3061250136 0.7575954108 0.5632575581
Interval Score
The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.
The score is computed as
\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{observed}) \cdot 1(\text{observed} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{observed} - \text{upper}) \cdot 1(\text{observed} > \text{upper})\]
where \(1()\) is the indicator
function and \(\alpha\) is the decimal
value that indicates how much is outside the prediction interval. To
improve usability, the user is asked to provide an interval range in
percentage terms, i.e. interval_range = 90 (percent) for a 90 percent
prediction interval. Correspondingly, the user would have to provide the
5% and 95% quantiles (the corresponding alpha would then be 0.1). No
specific distribution is assumed, but the range has to be symmetric (i.e
you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile
as the upper). Setting weigh = TRUE
will weigh the score by
\(\frac{\alpha}{2}\) such that the
Interval Score converges to the CRPS for increasing number of
quantiles.