Function to check the input data before running
score()
.
The data should come in one of three different formats:
A format for binary predictions (see example_binary)
A sample-based format for discrete or continuous predictions (see example_continuous and example_integer)
A quantile-based format (see example_quantile)
Arguments
- data
A data.frame or data.table with the predictions and observations. For scoring using
score()
, the following columns need to be present:true_value
- the true observed valuesprediction
- predictions or predictive samples for one true value. (You only don't need to provide a prediction column if you want to score quantile forecasts in a wide range format.)
For scoring integer and continuous forecasts a
sample
column is needed:sample
- an index to identify the predictive samples in the prediction column generated by one model for one true value. Only necessary for continuous and integer forecasts, not for binary predictions.
For scoring predictions in a quantile-format forecast you should provide a column called
quantile
:quantile
: quantile to which the prediction corresponds
In addition a
model
column is suggested and if not present this will be flagged and added to the input data with all forecasts assigned as an "unspecified model").You can check the format of your data using
check_forecasts()
and there are examples for each format (example_quantile, example_continuous, example_integer, and example_binary).
Value
A list with elements that give information about what scoringutils
thinks you are trying to do and potential issues.
target_type
the type of the prediction target as inferred from the input: 'binary', if all values intrue_value
are either 0 or 1 and values inprediction
are between 0 and 1, 'discrete' if all true values are integers. and 'continuous' if not.prediction_type
inferred type of the prediction. 'quantile', if there is a column called 'quantile', else 'discrete' if all values inprediction
are integer, else 'continuous.forecast_unit
unit of a single forecast, i.e. the grouping that uniquely defines a single forecast. This is assumed to be all present columns apart from the following protected columns:c("prediction", "true_value", "sample", "quantile","range", "boundary")
. It is important that you remove all unnecessary columns before scoring.rows_per_forecast
a data.frame that shows how many rows (usually quantiles or samples there are available per forecast. If a forecast model has several entries, then there a forecasts with differing numbers of quantiles / samples.unique_values
A data.frame that shows how many unique values there are present per model and column in the data. This doesn't directly show missing values, but rather the maximum number of unique values across the whole data.warnings
A vector with warnings. These can be ignored if you know what you are doing.errors
A vector with issues that will cause an error when runningscore()
.messages
A verbal explanation of the information provided above.
See also
Function to move from sample-based to quantile format:
sample_to_quantile()
Author
Nikos Bosse nikosbosse@gmail.com
Examples
check <- check_forecasts(example_quantile)
#> The following messages were produced when checking inputs:
#> 1. 144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
print(check)
#> Your forecasts seem to be for a target of the following type:
#> $target_type
#> [1] "integer"
#>
#> and in the following format:
#> $prediction_type
#> [1] "quantile"
#>
#> The unit of a single forecast is defined by:
#> $forecast_unit
#> [1] "location" "target_end_date" "target_type" "location_name"
#> [5] "forecast_date" "model" "horizon"
#>
#> Cleaned data, rows with NA values in prediction or true_value removed:
#> $cleaned_data
#> location target_end_date target_type true_value location_name
#> 1: DE 2021-05-08 Cases 106987 Germany
#> 2: DE 2021-05-08 Cases 106987 Germany
#> 3: DE 2021-05-08 Cases 106987 Germany
#> 4: DE 2021-05-08 Cases 106987 Germany
#> 5: DE 2021-05-08 Cases 106987 Germany
#> ---
#> 20397: IT 2021-07-24 Deaths 78 Italy
#> 20398: IT 2021-07-24 Deaths 78 Italy
#> 20399: IT 2021-07-24 Deaths 78 Italy
#> 20400: IT 2021-07-24 Deaths 78 Italy
#> 20401: IT 2021-07-24 Deaths 78 Italy
#> forecast_date quantile prediction model horizon
#> 1: 2021-05-03 0.010 82466 EuroCOVIDhub-ensemble 1
#> 2: 2021-05-03 0.025 86669 EuroCOVIDhub-ensemble 1
#> 3: 2021-05-03 0.050 90285 EuroCOVIDhub-ensemble 1
#> 4: 2021-05-03 0.100 95341 EuroCOVIDhub-ensemble 1
#> 5: 2021-05-03 0.150 99171 EuroCOVIDhub-ensemble 1
#> ---
#> 20397: 2021-07-12 0.850 352 epiforecasts-EpiNow2 2
#> 20398: 2021-07-12 0.900 397 epiforecasts-EpiNow2 2
#> 20399: 2021-07-12 0.950 499 epiforecasts-EpiNow2 2
#> 20400: 2021-07-12 0.975 611 epiforecasts-EpiNow2 2
#> 20401: 2021-07-12 0.990 719 epiforecasts-EpiNow2 2
#>
#> Number of unique values per column per model:
#> $unique_values
#> model location target_end_date target_type true_value
#> 1: EuroCOVIDhub-ensemble 4 12 2 96
#> 2: EuroCOVIDhub-baseline 4 12 2 96
#> 3: epiforecasts-EpiNow2 4 12 2 95
#> 4: UMass-MechBayes 4 12 1 48
#> location_name forecast_date quantile prediction horizon
#> 1: 4 11 23 3969 3
#> 2: 4 11 23 3733 3
#> 3: 4 11 23 3903 3
#> 4: 4 11 23 1058 3
#>
#> $messages
#> [1] "144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected."
#>
check_forecasts(example_binary)
#> The following messages were produced when checking inputs:
#> 1. 144 values for `true_value` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#> 2. 144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
#> Your forecasts seem to be for a target of the following type:
#> $target_type
#> [1] "binary"
#>
#> and in the following format:
#> $prediction_type
#> [1] "continuous"
#>
#> The unit of a single forecast is defined by:
#> $forecast_unit
#> [1] "location" "location_name" "target_end_date" "target_type"
#> [5] "forecast_date" "model" "horizon"
#>
#> Cleaned data, rows with NA values in prediction or true_value removed:
#> $cleaned_data
#> location location_name target_end_date target_type forecast_date
#> 1: DE Germany 2021-05-08 Cases 2021-05-03
#> 2: DE Germany 2021-05-08 Cases 2021-05-03
#> 3: DE Germany 2021-05-08 Cases 2021-05-03
#> 4: DE Germany 2021-05-08 Deaths 2021-05-03
#> 5: DE Germany 2021-05-08 Deaths 2021-05-03
#> ---
#> 883: IT Italy 2021-07-24 Deaths 2021-07-12
#> 884: IT Italy 2021-07-24 Deaths 2021-07-05
#> 885: IT Italy 2021-07-24 Deaths 2021-07-12
#> 886: IT Italy 2021-07-24 Deaths 2021-07-05
#> 887: IT Italy 2021-07-24 Deaths 2021-07-12
#> model horizon prediction true_value
#> 1: EuroCOVIDhub-ensemble 1 0.375 0
#> 2: EuroCOVIDhub-baseline 1 0.475 0
#> 3: epiforecasts-EpiNow2 1 0.425 0
#> 4: EuroCOVIDhub-ensemble 1 0.425 0
#> 5: EuroCOVIDhub-baseline 1 0.500 1
#> ---
#> 883: EuroCOVIDhub-baseline 2 0.250 0
#> 884: UMass-MechBayes 3 0.475 0
#> 885: UMass-MechBayes 2 0.450 0
#> 886: epiforecasts-EpiNow2 3 0.375 0
#> 887: epiforecasts-EpiNow2 2 0.300 0
#>
#> Number of unique values per column per model:
#> $unique_values
#> model location location_name target_end_date target_type
#> 1: EuroCOVIDhub-ensemble 4 4 12 2
#> 2: EuroCOVIDhub-baseline 4 4 12 2
#> 3: epiforecasts-EpiNow2 4 4 12 2
#> 4: UMass-MechBayes 4 4 12 1
#> forecast_date horizon prediction true_value
#> 1: 11 3 13 2
#> 2: 11 3 20 2
#> 3: 11 3 13 2
#> 4: 11 3 10 2
#>
#> $messages
#> [1] "144 values for `true_value` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected."
#> [2] "144 values for `prediction` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected."
#>