Overview
SeverityEstimate fits a Bayesian model for the
symptomatic infection rate (SIR) and infection fatality ratio (IFR)
using case line list data observed through active and passive
surveillance. The package is based on Lessler et al. (2016) doi:10.1093/aje/kwv452,
and the supplementary material distributed with that paper gives the
core likelihood derivations.
This vignette is about the model itself rather than the workflow. It has two goals:
- Restate the key probability model in the notation used by the package.
- Call out the implementation details that matter in
inst/stan/estimate_severity.stan, especially strata handling and the latent hazard process.
For a package walkthrough, see
vignette("getting-started").
old <- options(mc.cores = 1L)
on.exit(options(old))
library(SeverityEstimate)
#> Loading required package: checkmate
#> Loading required package: rstan
#> Loading required package: StanHeaders
#>
#> rstan version 2.32.7 (Stan version 2.32.2)
#> For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores()).
#> To avoid recompilation of unchanged Stan programs, we recommend calling
#> rstan_options(auto_write = TRUE)
#> For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
#> change `threads_per_chain` option:
#> rstan_options(threads_per_chain = 1)Notation
The Stan program indexes time by and strata cells by .
For each stratum , the model estimates:
-
xi[g]or : The symptomatic infection rate (SIR), that is the probability an infection is symptomatic. -
mortality[g]or : The infection fatality ratio (IFR), that is the probability an infection dies.
The surveillance parameters are shared across strata:
-
active_detectionor : Probability an infection is detected by active surveillance. -
passive_asymptomatic_detectionor : Probability an asymptomatic infection is eventually detected by passive surveillance. -
passive_symptomatic_detectionor : Probability a symptomatic infection is eventually detected by passive surveillance.
Occurrence of Symptoms Among Confirmed Infections
As in the paper, this model assumes that the SIR/IFR detected through active surveillance is the same rate as all infections, regardless of detection status:
Then for convenience, the stan model defines an intermediate quantity as:
Then, to account for the fact that the vast majority of cases detected through passive surveillance will be more severe cases, the SIR/IFR conditional on being detected through passive surveillance is given by:
The presentation of individual cases, by severity, stratum, and surveillance method can be estimated via a bernoulli likelihood with the probabilities above.
Latent Infections and Observed Incidence
-
population[g]or : The total number of persons in stratum which is used to initilize the model populations. -
S[t, g]or : The number of susceptibles at time in stratum . -
C[t, g]or : The number of latent infections at time in stratum . -
logit_hzd[t, g]or : The force of infection at time in stratum on a logit scale.
The susceptible population, latent infections, and force of infection at are intialized via:
and for ,
Observed counts are linked to these latent infections by Poisson observation models:
Strata Effects
The paper formulation is age-specific, but the package generalizes this to an arbitrary cross-product of strata variables. The key change is that the Stan model does not estimate a completely free parameter for every strata cell. Instead, it builds additive linear predictors and then transforms them via a logistic link function:
Where X_strata or
is a design matrix.
Unsmoothed strata
If a strata variable is declared with
degrees_of_freedom = 0L, the package uses a sum-to-zero
categorical contrast basis with K - 1 columns for
K levels. This gives a standard categorical effect while
keeping the intercept interpretable as a grand mean on the logit
scale.
Ordered, smoothed strata
If degrees_of_freedom > 0L, the strata levels must be
supplied explicitly and are treated as ordered. The package then uses
orthogonal polynomial basis terms via stats::poly() and
standardizes the resulting columns before passing them to Stan.
If there are K ordered levels and d
requested degrees of freedom, the basis has d columns with
d <= K - 2. For example, a five-level age variable with
degrees_of_freedom = 2L produces two basis columns
corresponding roughly to a linear and quadratic trend across the ordered
levels.
Priors and generated quantities
The remaining priors are straightforward:
- and are drawn from a prior.
- The strata coefficients and also have priors when any strata basis columns are present.
- The three detection probabilities have user-configurable Beta priors which default to if not explicitly set by the user.
The generated-quantities block simulates additional unobserved cases through active and passive surveillance. Those draws are useful when downstream code wants posterior samples of total infections rather than only the observed case counts.
Compact worked example
The code below mirrors the workflow from
vignette("getting-started"), but the main point here is how
the fitted object maps back to the model quantities just described. This
example also includes an additional health_occupation
strata to highlight the ability to fit unordered strata as well.
To encode extra infection risk for health care professionals, the
synthetic data uses a higher force of infection for the
health_occupation = "yes" strata. That change belongs in
the hazard process, not in sir, because sir
controls symptom risk conditional on infection rather than exposure
risk.
strata <- do.call(
rbind,
lapply(
list(
list(
age = "youth",
health_occupation = "no",
ifr = 0.25,
sir = 0.40,
population = 10000L
),
list(
age = "adult",
health_occupation = "no",
ifr = 0.30,
sir = 0.55,
population = 20000L
),
list(
age = "senior",
health_occupation = "no",
ifr = 0.35,
sir = 0.70,
population = 5000L
),
list(
age = "youth",
health_occupation = "yes",
ifr = 0.25,
sir = 0.50,
population = 0L
),
list(
age = "adult",
health_occupation = "yes",
ifr = 0.30,
sir = 0.65,
population = 2000L
),
list(
age = "senior",
health_occupation = "yes",
ifr = 0.35,
sir = 0.80,
population = 250L
)
),
as.data.frame
)
)
times <- seq(as.Date("2024-01-01"), as.Date("2024-03-31"), "+7 days")
baseline_force_of_infection <- ifelse(
strata$health_occupation == "yes",
0.010,
0.005
)
force_of_infection <- matrix(
data = rep(baseline_force_of_infection, each = length(times)),
nrow = length(times),
ncol = nrow(strata)
)
linelist <- create_sample_linelist(
strata,
times,
0.15,
0.05,
0.95,
force_of_infection = force_of_infection,
seed = 123L
)
population <- strata[, c("age", "health_occupation", "population")]
model <- SeverityEstimateModel(linelist, population) |>
set_active_prior(alpha = 1.0, beta = 1.0) |>
set_passive_asymptomatic_prior(alpha = 1.0, beta = 3.0) |>
set_passive_symptomatic_prior(alpha = 3.0, beta = 1.0) |>
set_strata(
"age",
levels = c("youth", "adult", "senior"),
degrees_of_freedom = 1L
) |>
set_strata("health_occupation") |>
set_timesteps("time") |>
set_detection(
"detection",
map = c("Active" = "active", "Passive" = "passive")
) |>
set_outcome(
"outcome",
map = c(
"Asymptomatic" = "asymptomatic",
"Symptomatic" = "symptomatic",
"Death" = "severe"
)
)
model
#> Severity Estimate Model:
#>
#> Data:
#> dataset rows columns
#> line_list 1548 6
#> population 6 3
#>
#> Detection Probability Priors:
#> active prior: beta(1.0, 1.0)
#> passive_asymptomatic prior: beta(1.0, 3.0)
#> passive_symptomatic prior: beta(3.0, 1.0)
#>
#> Timesteps:
#> time: 2024-01-01 to 2024-03-25 (13 timesteps)
#>
#> Detection:
#> column: detection
#> active: 422 cases (values: Active)
#> passive: 1126 cases (values: Passive)
#>
#> Outcome:
#> column: outcome
#> asymptomatic: 224 cases (values: Asymptomatic)
#> symptomatic: 705 cases (values: Symptomatic)
#> severe: 619 cases (values: Death)
#>
#> Strata:
#> age: 3 levels, df = 1 (youth, adult, senior)
#> health_occupation: 2 levels, df = 0 (no, yes)
estimate <- fit(
model,
chains = 1L,
iter = 250L,
seed = 123L,
refresh = 0
)
#> Warning: The largest R-hat is NA, indicating chains have not mixed.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#r-hat
#> Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#bulk-ess
#> Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#tail-esscalculate_parameter_estimates() summarizes the three
surveillance detection parameters, while
calculate_fatality_ratio() summarizes
mortality[g] as IFR and xi[g] as SIR for each
stratum.
calculate_parameter_estimates(estimate, alpha = 0.05)
#> parameter parameter_description
#> 1 active_detection active detection rate
#> 2 passive_asymptomatic_detection mildly/asymptomatic passive detection rate
#> 3 passive_symptomatic_detection severe symptoms passive detection rate
#> mean_estimate median_estimate lower_05 upper_05
#> 1 0.16855577 0.16852061 0.14415790 0.19173526
#> 2 0.06160089 0.06066584 0.04187331 0.08682612
#> 3 0.93367191 0.94876350 0.78580276 0.99976938
calculate_fatality_ratio(
estimate,
median_estimate = TRUE,
mean_estimate = FALSE,
naive_estimate = TRUE
)
#> age health_occupation ifr_median_estimate ifr_lower_05 ifr_upper_05
#> 1 youth no 0.2433681 0.2076605 0.2882321
#> 2 adult no 0.2980740 0.2753510 0.3290826
#> 3 adult yes 0.3067657 0.2469669 0.3662032
#> 4 senior no 0.3607254 0.3124203 0.4029461
#> 5 senior yes 0.3700131 0.2967118 0.4532138
#> sir_median_estimate sir_lower_05 sir_upper_05 naive_ifr naive_sir
#> 1 0.3808350 0.3123373 0.4404289 0.3890675 0.7749196
#> 2 0.5444398 0.5014853 0.5975940 0.3928571 0.8418367
#> 3 0.7163066 0.5862772 0.8051473 0.3926702 0.9214660
#> 4 0.7022471 0.6459089 0.7778427 0.4576271 0.9406780
#> 5 0.8338584 0.7252161 0.8999805 0.2692308 0.9615385For model-diagnostics work, calculate_hazard() exposes
the latent infection hazard on the probability scale for each
time-period and strata cell. This is the direct summary of
inv_logit(logit_hzd[t, g]), so it is the most useful
high-level extractor when you want to inspect whether the fitted
exposure process is behaving sensibly.
hazard_summary <- calculate_hazard(
estimate,
median_estimate = TRUE,
mean_estimate = FALSE,
alpha = 0.05
)
head(hazard_summary, 8L)
#> time age health_occupation median_estimate lower_05 upper_05
#> 1 2024-01-01 youth no 4.397768e-03 2.995876e-03 6.845800e-03
#> 2 2024-01-01 adult no 5.177858e-03 4.027286e-03 6.760096e-03
#> 3 2024-01-01 adult yes 1.223843e-02 7.412754e-03 1.868032e-02
#> 4 2024-01-01 senior no 6.943112e-03 5.033608e-03 1.025725e-02
#> 5 2024-01-01 senior yes 3.355682e-14 2.112869e-16 8.676132e-12
#> 6 2024-01-08 youth no 5.799071e-03 3.638674e-03 8.191569e-03
#> 7 2024-01-08 adult no 4.794799e-03 3.876673e-03 6.364966e-03
#> 8 2024-01-08 adult yes 8.997502e-03 5.405247e-03 1.422111e-02The returned data.frame is long-form, with one row per
time_period and strata combination. In
practice, this is the easiest way to identify cells where the fitted
hazard is implausibly high, implausibly low, or moving differently than
the observed passive-incidence signal would suggest.
What to keep in mind when reading fitted results
- IFR and SIR are stratum-specific but are linked through shared detection parameters.
- Passive cases are not sampled from the same outcome distribution as active cases, they are outcome biased..
- The current strata formulation is additive across strata dimensions unless the user constructs interaction strata explicitly.
- The current hazard prior is a per-cell normal prior on the logit hazard.
Those details are the main bridge between the original paper supplement and the package’s current Stan implementation.