Use prognostic covariate adjustment when fitting an rctglm
Source:R/rctglm_with_prognosticscore.R
rctglm_with_prognosticscore.Rd
The procedure uses fit_best_learner to fit a prognostic model to historical data and uses the model to produce counterfactual predictions as a prognostic score that is then adjusted for as a covariate in the rctglm procedure. See Powering RCTs for marginal effects with GLMs using prognostic score adjustment by Højbjerre-Frandsen et. al (2025) for more details.
Usage
rctglm_with_prognosticscore(
formula,
exposure_indicator,
exposure_prob,
data,
family = gaussian,
estimand_fun = "ate",
estimand_fun_deriv0 = NULL,
estimand_fun_deriv1 = NULL,
cv_variance = FALSE,
cv_variance_folds = 10,
...,
data_hist,
prog_formula = NULL,
cv_prog_folds = 5,
learners = default_learners(),
verbose = options::opt("verbose")
)
Arguments
- formula
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’ in the glm documentation.
- exposure_indicator
(name of) the binary variable in
data
that identifies randomisation groups. The variable is required to be binary to make the "orientation" of theestimand_fun
clear.- exposure_prob
a
numeric
with the probability of being in "group 1" (rather than group 0) in groups defined byexposure_indicator
.- data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which the function is called.
- family
a description of the error distribution and link function to be used in the model. For
glm
this can be a character string naming a family function, a family function or the result of a call to a family function. Forglm.fit
only the third option is supported. (Seefamily
for details of family functions.)- estimand_fun
a
function
with argumentspsi1
andpsi0
specifying the estimand. Alternative, specify "ate" or "rate_ratio" as acharacter
to use one of the default estimand functions. See more details in the "Estimand" section of rctglm.- estimand_fun_deriv0
a
function
specifying the derivative ofestimand_fun
wrt.psi0
. As a default the algorithm will use symbolic differentiation to automatically find the derivative fromestimand_fun
- estimand_fun_deriv1
a
function
specifying the derivative ofestimand_fun
wrt.psi1
. As a default the algorithm will use symbolic differentiation to automatically find the derivative fromestimand_fun
- cv_variance
a
logical
determining whether to estimate the variance using cross-validation (see details of rctglm).- cv_variance_folds
a
numeric
with the number of folds to use for cross validation ifcv_variance
isTRUE
.- ...
Additional arguments passed to
stats::glm()
- data_hist
a
data.frame
with historical data on which to fit a prognostic model- prog_formula
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the prognostic model to be fitted to
data_hist
. Passed in a list as thepreproc
argument infit_best_learner()
. As a default, the formula is created by modelling the response (assumed to have the same name as informula
) using all columns indata_hist
.- cv_prog_folds
a
numeric
with the number of cross-validation folds used when fitting and evaluating models- learners
a
list
(preferably named) containing named lists of elementsmodel
and optionallygrid
. Themodel
element should be aparsnip
model specification, which is passed to workflowsets::workflow_set as themodel
argument, while thegrid
element is passed as thegrid
argument of workflowsets::option_add- verbose
numeric
verbosity level. Higher values means more information is printed in console. A value of 0 means nothing is printed to console during execution (Defaults to2
, overwritable using option 'postcard.verbose' or environment variable 'R_POSTCARD_VERBOSE')
Value
rctglm_with_prognosticscore
returns an object of class rctglm_prog
,
which inherits from rctglm.
An rctglm_prog
object is a list with the same components as an rctglm object
(see the Value
section of rctglm for a breakdown of the structure),
but with an additional list element of:
prognostic_info
: List with information about the fitted prognostic model on historical data. It has components:formula
: Theformula
with symbolic description of how the response is modelled as function of covariates in the modelsmodel_fit
: A trainedworkflow
- the result of fit_best_learnerlearners
: Alist
of learners used for the discrete super learnercv_folds
: The amount of folds used for cross validationdata
: The historical data used for cross validation when fitting and testing models
Details
Prognostic covariate adjustment involves training a prognostic model (using
fit_best_learner) on historical data (data_hist
) to predict the response
in that data.
Assuming that the
historical data is representative of the comparator group in a “new” data
set (group 0 of the binary exposure_indicator
in data
), we can use the
prognostic model to predict the counterfactual
outcome of all observations (including the ones in the comparator group
for which the prediction of counterfactual outcome coincides with a
prediction of actual outcome).
This prediction, which is called the prognostic score, is then used as an
adjustment covariate in the GLM (the prognostic score is added to formula
before calling rctglm with the modified formula).
See much more details in the reference in the description.
See also
Method to extract information of the prognostic model in prog. Function
used to fit the prognostic model is fit_best_learner()
.
See rctglm()
for the function and class this inherits from.
Examples
# Generate some data
n <- 100
b0 <- 1
b1 <- 1.5
b2 <- 2
W1 <- runif(n, min = -2, max = 2)
exp_prob <- .5
dat_treat <- glm_data(
Y ~ b0+b1*abs(sin(W1))+b2*A,
W1 = W1,
A = rbinom (n, 1, exp_prob)
)
dat_notreat <- glm_data(
Y ~ b0+b1*abs(sin(W1)),
W1 = W1
)
learners <- list(
mars = list(
model = parsnip::set_engine(
parsnip::mars(
mode = "regression", prod_degree = 3
),
"earth"
)
),
lm = list(
model = parsnip::set_engine(
parsnip::linear_reg(),
"lm"
)
)
)
ate <- rctglm_with_prognosticscore(
formula = Y ~ .,
exposure_indicator = A,
exposure_prob = exp_prob,
data = dat_treat,
family = gaussian(),
estimand_fun = "ate",
data_hist = dat_notreat,
learners = learners)
#>
#> ── Fitting prognostic model ──
#>
#> ℹ Created formula for fitting prognostic model as: Y ~ .
#> ℹ Fitting learners
#> • mod_mars
#> • mod_lm
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: mod_mars
#> ✔ 1 of 2 resampling: mod_mars (125ms)
#> i No tuning parameters. `fit_resamples()` will be attempted
#> i 2 of 2 resampling: mod_lm
#> ✔ 2 of 2 resampling: mod_lm (92ms)
#> ℹ Model with lowest RMSE: mod_mars
#> ℹ Investigate trained learners and fitted model in `prognostic_info` list element
#>
#> ── Symbolic differentiation of estimand function ──
#>
#> ℹ Symbolically deriving partial derivative of the function 'psi1 - psi0' with respect to 'psi0' as: '-1'.
#> • Alternatively, specify the derivative through the argument
#> `estimand_fun_deriv0`
#> ℹ Symbolically deriving partial derivative of the function 'psi1 - psi0' with respect to 'psi1' as: '1'.
#> • Alternatively, specify the derivative through the argument
#> `estimand_fun_deriv1`
# Pull information on estimand
estimand(ate)
#> Estimate Std. Error
#> 1 2.043516 0.1911481