These functions validate, clean, and convert raw user-supplied data structures
(locations, observations, and populations) into the canonical forms required
by the [sampling()] sampler and the underlying Stan models.
Usage
canonicalize_locations(locations)
canonicalize_observations(observations, drop_extra = TRUE)
canonicalize_populations(
populations,
observations,
locations,
max_cohort,
max_age,
max_dose = 2L
)Arguments
- locations
a
[data.frame()], with columnsloc_idandparent_id, of the same type. See Details for restrictions.- observations
a
[data.frame()], the observed data, with at least three columns:an
obs_idcolumn; any type, as long as unique, non-NAa
positivecolumn; non-negative integers, the observed number of vaccinated individualsa
sample_ncolumn; positive integers, the number of individuals sampled, must be greater than or equal to "positive"optionally, a
censoredcolumn; numeric, NA (uncensored) or 1 (right-censored); if not present, will be assumed NA
- drop_extra
a logical scalar; drop extraneous columns? (default: yes)
- populations
a
[data.frame()], the observation meta data, with columnsobs_id, any type; the observation the row concerns (i.e. id shared with an observations data object)loc_id, any type; the location the row concerns (i.e. id shared with a locations data object)dose, a non-zero, positive integer (1, 2, ...); what dose row concernscohort, a positive integer; the cohort at the location row concernsage, a positive integer; the age of that cohort row concernsweight, a numeric, (0, 1); the relative contribution of this row to an observation. Optional if each population row has a uniqueobs_id.
- max_cohort
if present, what is the maximum cohort that should be present?
- max_age
if present, what is the maximum age that should be present?
- max_dose
maximum dose number to allow (default: 2L)
Value
canonicalize_locations returns a data.table, with:
loc_id,parent_idcolumns as originally supplied, possibly reorderedloc_c_id,loc_cp_idcolumns, canonicalized id/parent_id columns, representing the order that will be used in the samplerlayercolumn, an integer from 1 (root), 2 (root children), 3 (grandchildren), &clayer_boundcolumn, an integer starting from 1 by layer. This provides index slice information used in the stan model.
canonicalize_observations returns a canonical observation object,
a [data.table()] with:
an
obs_c_idcolumn, an integer sequence from 1; the order observations will be passed to estimationthe original
obs_idcolumn, possibly reorderedpositiveandsample_ncolumns, possibly reordereda "censored" column; all NA, if not present in original
observationsargument
canonicalize_populations returns a canonical populations object,
mirroring the input populations,
with the following updates:
obs_c_id, the observation id the row concerns, canonicalized to match the canonical observation idsloc_c_id, the location id the row concerns, canonicalized to matchreordered to
obs_c_idorder
Details
The imuGAP hierarchical modeling framework requires data structures to adhere to
specific relational and format constraints. The three canonicalize functions
process and validate these inputs as described below:
Locations (canonicalize_locations)
The [sampling()] sampler works on a hierarchical model of locations,
and must be provided that structure. This method checks location structure
validity, and returns a canonical version including the layer membership.
A valid structure has:
a unique root,
no cycles, and
no duplicate
loc_ids
Users may explicitly identify the root loc_id by providing a row with
parent_id equal to NA. Otherwise, any parent_id that does not appear
in loc_id is treated as the root.
If the input is valid, this method will create the canonicalized version.
In that version, all ids run from 1:N, where N is the number of distinct
ids. That order is determined by layer order, then position of parent
within its layer, then "natural" order (i.e., whatever base R sort()
yields).
Observations (canonicalize_observations)
The observations object documents observations used to fit the
model. Conceptually, each row represents an observation of vaccination status
within a population. That population need not be uniform
(see [canonicalize_populations()]) or concerning a single cohort or time:
each observation should generally be the best available resolution data. That
resolution can vary across rows. The sampler uses information
about the resolutions to automatically figure out how to compare the latent
process model to those different observations.
For the optional censored column: the model supports vaccination status
indicators which are vaccine specific as well as those which represent an
individual having all of a set of vaccines (including the target vaccine).
The specific coverage for the target vaccine is right-censored in the latter
case: full-set-coverage is the minimum coverage for the target.
When at least some of the data are censored, you must supply the censored
column to correctly estimate coverage. Mark any uncensored observations with
NA, and any right-censored observations with 1. Note that 0 is not a
valid value at this time; we are preserving that for potential future support
of left-censoring.
Populations (canonicalize_populations)
This method validates the meta-data associated with the observations, as well as converting that meta-data to use the canonical id formats.
Regarding "cohorts" and "ages": these are counted from 1, by 1 "unit". You can imagine the units are whatever resolution is appropriate for your data: months, quarters, years, etc. As long as these are used consistently, estimation will work, and take on the unit meaning you used for input.
Examples
# --- canonicalize_locations ---
data("locations_sim")
locations_sim
#> loc_id parent_id
#> <char> <char>
#> 1: State <NA>
#> 2: Scruggs State
#> 3: Simone State
#> 4: Watson State
#> 5: Chickadee Elementary Scruggs
#> 6: Nuthatch Academy Scruggs
#> 7: Blue Heron School Scruggs
#> 8: Flycatcher Elementary Scruggs
#> 9: Bluebird Learning Center Scruggs
#> 10: Catbird Academy Scruggs
#> 11: Finch Elementary Scruggs
#> 12: Sparrow School Scruggs
#> 13: Towhee Children's Academy Scruggs
#> 14: Warbler Elementary Scruggs
#> 15: Egret Elementary Simone
#> 16: Cardinal Academy Simone
#> 17: Bunting School Simone
#> 18: Tanager Academy Simone
#> 19: Oriole Youth Academy Simone
#> 20: Grosbeak Learning Center Simone
#> 21: Junco Elementary Simone
#> 22: Meadowlark School Watson
#> 23: Goldfinch Elementary Watson
#> 24: Mockingbird Academy Watson
#> 25: Kinglet Learning Center Watson
#> 26: Vireo School Watson
#> 27: Kingfisher Academy Watson
#> 28: Cormorant Elementary Watson
#> loc_id parent_id
#> <char> <char>
canonicalize_locations(locations_sim)
#> Key: <layer, parent_id, loc_id>
#> loc_id parent_id layer loc_c_id loc_cp_id layer_bound
#> <char> <char> <int> <int> <int> <int>
#> 1: State <NA> 1 1 NA 1
#> 2: Scruggs State 2 2 1 1
#> 3: Simone State 2 3 1 1
#> 4: Watson State 2 4 1 1
#> 5: Blue Heron School Scruggs 3 5 2 1
#> 6: Bluebird Learning Center Scruggs 3 6 2 1
#> 7: Catbird Academy Scruggs 3 7 2 1
#> 8: Chickadee Elementary Scruggs 3 8 2 1
#> 9: Finch Elementary Scruggs 3 9 2 1
#> 10: Flycatcher Elementary Scruggs 3 10 2 1
#> 11: Nuthatch Academy Scruggs 3 11 2 1
#> 12: Sparrow School Scruggs 3 12 2 1
#> 13: Towhee Children's Academy Scruggs 3 13 2 1
#> 14: Warbler Elementary Scruggs 3 14 2 1
#> 15: Bunting School Simone 3 15 3 11
#> 16: Cardinal Academy Simone 3 16 3 11
#> 17: Egret Elementary Simone 3 17 3 11
#> 18: Grosbeak Learning Center Simone 3 18 3 11
#> 19: Junco Elementary Simone 3 19 3 11
#> 20: Oriole Youth Academy Simone 3 20 3 11
#> 21: Tanager Academy Simone 3 21 3 11
#> 22: Cormorant Elementary Watson 3 22 4 18
#> 23: Goldfinch Elementary Watson 3 23 4 18
#> 24: Kingfisher Academy Watson 3 24 4 18
#> 25: Kinglet Learning Center Watson 3 25 4 18
#> 26: Meadowlark School Watson 3 26 4 18
#> 27: Mockingbird Academy Watson 3 27 4 18
#> 28: Vireo School Watson 3 28 4 18
#> loc_id parent_id layer loc_c_id loc_cp_id layer_bound
#> <char> <char> <int> <int> <int> <int>
# can also be provided in non-canonical order, and with an implicit root
weird_locations <- subset(locations_sim, !is.na(parent_id))[
sample(nrow(locations_sim) - 1L)
]
canonicalize_locations(weird_locations)
#> Key: <layer, parent_id, loc_id>
#> loc_id parent_id layer loc_c_id loc_cp_id layer_bound
#> <char> <char> <int> <int> <int> <int>
#> 1: State <NA> 1 1 NA 1
#> 2: Scruggs State 2 2 1 1
#> 3: Simone State 2 3 1 1
#> 4: Watson State 2 4 1 1
#> 5: Blue Heron School Scruggs 3 5 2 1
#> 6: Bluebird Learning Center Scruggs 3 6 2 1
#> 7: Catbird Academy Scruggs 3 7 2 1
#> 8: Chickadee Elementary Scruggs 3 8 2 1
#> 9: Finch Elementary Scruggs 3 9 2 1
#> 10: Flycatcher Elementary Scruggs 3 10 2 1
#> 11: Nuthatch Academy Scruggs 3 11 2 1
#> 12: Sparrow School Scruggs 3 12 2 1
#> 13: Towhee Children's Academy Scruggs 3 13 2 1
#> 14: Warbler Elementary Scruggs 3 14 2 1
#> 15: Bunting School Simone 3 15 3 11
#> 16: Cardinal Academy Simone 3 16 3 11
#> 17: Egret Elementary Simone 3 17 3 11
#> 18: Grosbeak Learning Center Simone 3 18 3 11
#> 19: Junco Elementary Simone 3 19 3 11
#> 20: Oriole Youth Academy Simone 3 20 3 11
#> 21: Tanager Academy Simone 3 21 3 11
#> 22: Cormorant Elementary Watson 3 22 4 18
#> 23: Goldfinch Elementary Watson 3 23 4 18
#> 24: Kingfisher Academy Watson 3 24 4 18
#> 25: Kinglet Learning Center Watson 3 25 4 18
#> 26: Meadowlark School Watson 3 26 4 18
#> 27: Mockingbird Academy Watson 3 27 4 18
#> 28: Vireo School Watson 3 28 4 18
#> loc_id parent_id layer loc_c_id loc_cp_id layer_bound
#> <char> <char> <int> <int> <int> <int>
# --- canonicalize_observations ---
data("observations_sim")
observations_sim
#> loc_id parent_id year enc_unit_id unit_id positive sample_n
#> <char> <char> <num> <num> <num> <num> <num>
#> 1: Chickadee Elementary Scruggs 2001 2 5 16 19
#> 2: Chickadee Elementary Scruggs 2002 2 5 14 20
#> 3: Chickadee Elementary Scruggs 2003 2 5 14 16
#> 4: Chickadee Elementary Scruggs 2004 2 5 10 13
#> 5: Chickadee Elementary Scruggs 2005 2 5 8 13
#> ---
#> 694: State <NA> 2021 NA 1 204 230
#> 695: State <NA> 2022 NA 1 198 215
#> 696: State <NA> 2023 NA 1 305 340
#> 697: State <NA> 2024 NA 1 327 345
#> 698: State <NA> 2025 NA 1 284 310
#> ly_min ly_max dose weight vaxview_type age censored cohort_min
#> <num> <num> <num> <num> <char> <char> <num> <num>
#> 1: 5 5 2 1.0 <NA> <NA> NA 4
#> 2: 5 5 2 1.0 <NA> <NA> NA 5
#> 3: 5 5 2 1.0 <NA> <NA> NA 6
#> 4: 5 5 2 1.0 <NA> <NA> NA 7
#> 5: 5 5 2 1.0 <NA> <NA> NA 8
#> ---
#> 694: 14 18 2 0.2 teen <NA> NA 11
#> 695: 14 18 2 0.2 teen <NA> NA 12
#> 696: 14 18 2 0.2 teen <NA> NA 13
#> 697: 14 18 2 0.2 teen <NA> NA 14
#> 698: 14 18 2 0.2 teen <NA> NA 15
#> cohort_max obs_id
#> <num> <int>
#> 1: 4 1
#> 2: 5 2
#> 3: 6 3
#> 4: 7 4
#> 5: 8 5
#> ---
#> 694: 15 694
#> 695: 16 695
#> 696: 17 696
#> 697: 18 697
#> 698: 19 698
canonicalize_observations(observations_sim)
#> Key: <censored, obs_id>
#> obs_c_id positive sample_n censored obs_id
#> <int> <int> <int> <num> <int>
#> 1: 1 16 19 NA 1
#> 2: 2 14 20 NA 2
#> 3: 3 14 16 NA 3
#> 4: 4 10 13 NA 4
#> 5: 5 8 13 NA 5
#> ---
#> 694: 694 340 385 1 656
#> 695: 695 268 292 1 657
#> 696: 696 289 325 1 658
#> 697: 697 330 374 1 659
#> 698: 698 250 301 1 660
# --- canonicalize_populations ---
data("populations_sim"); data("locations_sim"); data("observations_sim")
populations_sim
#> obs_id loc_id cohort age dose weight
#> <num> <char> <num> <num> <num> <num>
#> 1: 1 Chickadee Elementary 4 5 2 1.0
#> 2: 2 Chickadee Elementary 5 5 2 1.0
#> 3: 3 Chickadee Elementary 6 5 2 1.0
#> 4: 4 Chickadee Elementary 7 5 2 1.0
#> 5: 5 Chickadee Elementary 8 5 2 1.0
#> ---
#> 746: 698 State 19 14 2 0.2
#> 747: 698 State 18 15 2 0.2
#> 748: 698 State 17 16 2 0.2
#> 749: 698 State 16 17 2 0.2
#> 750: 698 State 15 18 2 0.2
canonicalize_populations(populations_sim, observations_sim, locations_sim)
#> Key: <obs_c_id, loc_c_id, cohort, age, dose>
#> obs_id loc_id cohort age dose weight obs_c_id loc_c_id
#> <num> <char> <int> <int> <num> <num> <int> <int>
#> 1: 1 Chickadee Elementary 4 5 2 1 1 8
#> 2: 2 Chickadee Elementary 5 5 2 1 2 8
#> 3: 3 Chickadee Elementary 6 5 2 1 3 8
#> 4: 4 Chickadee Elementary 7 5 2 1 4 8
#> 5: 5 Chickadee Elementary 8 5 2 1 5 8
#> ---
#> 746: 656 State 26 3 1 1 694 1
#> 747: 657 State 27 3 1 1 695 1
#> 748: 658 State 28 3 1 1 696 1
#> 749: 659 State 29 3 1 1 697 1
#> 750: 660 State 30 3 1 1 698 1
#> range_start
#> <int>
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> ---
#> 746: 746
#> 747: 747
#> 748: 748
#> 749: 749
#> 750: 750
