Fill missing data in a data set to prepare it for use within the package
Source:R/preprocessing.R
fill_missing.Rd
This function ensures that all days between the first and last date in the
data are present. It adds an accumulate
column that indicates whether
modelled observations should be accumulated onto a later data point.
point. This is useful for modelling data that is reported less frequently
than daily, e.g. weekly incidence data, as well as other reporting
artifacts such as delayed weekedn reporting. The function can also be used
to fill in missing observations with zeros.
Arguments
- data
Data frame with a
date
column. The other columns depend on the model that the data are to be used, e.g.estimate_infections()
orestimate_secondary()
. See the documentation there for the expected format. The data must not already have anaccumulate
function, otherwise the function will fail with an error.- missing_dates
Character. Options are "ignore" (the default), "accumulate" and "zero". This determines how missing dates in the data are interpreted. If set to "ignore", any missing dates in the observation data will be interpreted as missing and skipped in the likelihood. If set to "accumulate", modelled observations on dates that are missing in the data will be accumulated and added to the next non-missing data point. This can be used to model incidence data that is reported less frequently than daily. In that case, the first data point is not included in the likelihood (unless
initial_accumulate
is set to a value greater than one) but used only to reset modelled observations to zero. If "zero" then all observations on missing dates will be assumed to be of value 0.- missing_obs
Character. How to process dates that exist in the data but have observations with NA values. The options available are the same ones as for the
missing_dates
argument.- initial_accumulate
Integer. The number of initial dates to accumulate if
missing_dates
ormissing_obs
is set to"accumulate"
. This argument needs ot have a minimum of 1. If it is set to 1 then no accumulation is happening on the first data point. If it is greater than 1 then dates are added to the beginning of the data set to get be able to have a sufficient number of modelled observations accumulated onto the first data point. For modelling weekly incidence data this should be set to 7. If accumulating and the first data point is not NA and this is argument is not set, then that data point will be removed with a warning.- obs_column
Character (default: "confirm"). If given, only the column specified here will be used for checking missingness. This is useful if using a data set that has multiple columns of hwich one of them corresponds to observations that are to be processed here.
- by
Character vector. Name(s) of any additional column(s) where data processing should be done separately for each value in the column. This is useful when using data representing e.g. multiple geographies. If NULL (default) no such grouping is done.
Value
a data.table with an accumulate
column that indicates whether
values are accumulated (see the documentation of the data
argument in
estimate_infections()
)
Examples
cases <- data.table::copy(example_confirmed)
## calculate weekly sum
cases[, confirm := data.table::frollsum(confirm, 7)]
#> date confirm
#> <Date> <num>
#> 1: 2020-02-22 NA
#> 2: 2020-02-23 NA
#> 3: 2020-02-24 NA
#> 4: 2020-02-25 NA
#> 5: 2020-02-26 NA
#> ---
#> 126: 2020-06-26 1695
#> 127: 2020-06-27 1950
#> 128: 2020-06-28 1861
#> 129: 2020-06-29 1811
#> 130: 2020-06-30 1716
## limit to dates once a week
cases <- cases[seq(7, nrow(cases), 7)]
## set the second observation to missing
cases[2, confirm := NA]
#> date confirm
#> <Date> <num>
#> 1: 2020-02-28 647
#> 2: 2020-03-06 NA
#> 3: 2020-03-13 11255
#> 4: 2020-03-20 25922
#> 5: 2020-03-27 39504
#> 6: 2020-04-03 34703
#> 7: 2020-04-10 28384
#> 8: 2020-04-17 25315
#> 9: 2020-04-24 21032
#> 10: 2020-05-01 15490
#> 11: 2020-05-08 10395
#> 12: 2020-05-15 7238
#> 13: 2020-05-22 4910
#> 14: 2020-05-29 3726
#> 15: 2020-06-05 2281
#> 16: 2020-06-12 2129
#> 17: 2020-06-19 2017
#> 18: 2020-06-26 1695
## fill missing data
fill_missing(cases, missing_dates = "accumulate", initial_accumulate = 7)
#> Key: <date>
#> date confirm accumulate
#> <Date> <num> <lgcl>
#> 1: 2020-02-22 NA TRUE
#> 2: 2020-02-23 NA TRUE
#> 3: 2020-02-24 NA TRUE
#> 4: 2020-02-25 NA TRUE
#> 5: 2020-02-26 NA TRUE
#> ---
#> 122: 2020-06-22 NA TRUE
#> 123: 2020-06-23 NA TRUE
#> 124: 2020-06-24 NA TRUE
#> 125: 2020-06-25 NA TRUE
#> 126: 2020-06-26 1695 FALSE