--- title: "Bayesian simulation pipelines with stantargets" output: rmarkdown::html_vignette bibliography: simulation.bib vignette: > %\VignetteIndexEntry{Bayesian simulation pipelines with stantargets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} # With the root.dir option below, # this vignette runs the R code in a temporary directory # so new files are written to temporary storage # and not the user's file space. knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6 ) if (identical(Sys.getenv("NOT_CRAN", unset = "false"), "false")) { knitr::opts_chunk$set(eval = FALSE) } library(cmdstanr) library(dplyr) library(ggplot2) library(posterior) library(purrr) library(targets) library(tidyr) library(stantargets) if (identical(Sys.getenv("IN_PKGDOWN"), "true")) { cmdstanr::install_cmdstan() } ``` ## Background The [introductory vignette](https://docs.ropensci.org/stantargets/articles/introduction.html) vignette caters to Bayesian data analysis workflows with few datasets to analyze. However, it is sometimes desirable to run one or more Bayesian models repeatedly across multiple simulated datasets. Examples: 1. Validate the implementation of a Bayesian model using simulation. 2. Simulate a randomized controlled experiment to explore frequentist properties such as power and Type I error. This vignette focuses on (1). ## Example project Visit for an example project based on this vignette. The example has an [RStudio Cloud workspace](https://rstudio.cloud/project/2466069) which allows you to run the project in a web browser. ## Interval-based model validation pipeline This particular example uses the concept of calibration that Bob Carpenter [explains here](https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/) [@carpenter2017]. The goal is to simulate multiple datasets from the model below, analyze each dataset, and assess how often the estimated posterior intervals cover the true parameters from the prior predictive simulations. If coverage is no systematically different from nominal, this is evidence that the model was implemented correctly. The quantile method by @cook2006 generalizes this concept, and simulation-based calibration [@talts2020] generalizes further. The interval-based technique featured in this vignette is not as robust as SBC, but it may be more expedient for large models because it does not require visual inspection of multiple histograms. See a later section in this vignette for an example of simulation-based calibration on this same model. ```{r, eval = TRUE} lines <- "data { int n; vector[n] x; vector[n] y; } parameters { vector[2] beta; } model { y ~ normal(beta[1] + x * beta[2], 1); beta ~ normal(0, 1); }" writeLines(lines, "model.stan") ``` Next, we define a pipeline to simulate multiple datasets and fit each dataset with the model. In our data-generating function, we put the true parameter values of each simulation in a special `.join_data` list. `stantargets` will automatically join the elements of `.join_data` to the correspondingly named variables in the summary output. This will make it super easy to check how often our posterior intervals capture the truth. As for scale, generate 10 datasets (5 batches with 2 replications each) and run the model on each of the 10 datasets.^[Internally, each batch is a [dynamic branch target](https://books.ropensci.org/targets/dynamic.html), and the number of replications determines the amount of work done within a branch. In the general case, [batching](https://books.ropensci.org/targets/dynamic.html#batching) is a way to find the right compromise between target-specific overhead and the horizontal scale of the pipeline.] By default, each of the 10 model runs computes 4 MCMC chains with 2000 MCMC iterations each (including burn-in) and you can adjust with the `chains`, `iter_sampling`, and `iter_warmup` arguments of `tar_stan_mcmc_rep_summary()`. ```{r, echo = FALSE, eval = TRUE} library(targets) tar_script({ library(stantargets) options(crayon.enabled = FALSE) # Use computer memory more sparingly: tar_option_set(memory = "transient", garbage_collection = TRUE) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } list( tar_stan_mcmc_rep_summary( model, "model.stan", simulate_data(), batches = 5, # Number of branch targets. reps = 2, # Number of model reps per branch target. variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ), stdout = R.utils::nullfile(), stderr = R.utils::nullfile() ) ) }) tar_load_globals() ``` ```{r, eval = FALSE} # _targets.R library(targets) library(stantargets) options(crayon.enabled = FALSE) # Use computer memory more sparingly: tar_option_set(memory = "transient", garbage_collection = TRUE) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } list( tar_stan_mcmc_rep_summary( model, "model.stan", simulate_data(), # Runs once per rep. batches = 5, # Number of branch targets. reps = 2, # Number of model reps per branch target. variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ), stdout = R.utils::nullfile(), stderr = R.utils::nullfile() ) ) ``` We now have a pipeline that runs the model 10 times: 5 batches (branch targets) with 2 replications per batch. ```{r, eval = TRUE} tar_visnetwork() ``` Run the computation with `tar_make()` ```{r, output = FALSE, warning = FALSE, eval = TRUE} tar_make() ``` The result is an aggregated data frame of summary statistics, where the `.rep` column distinguishes among individual replicates. We have the posterior intervals for `beta` in columns `q2.5` and `q97.5`. And thanks to `.join_data` in `simulate_data()`, there is a special `.join_data` column in the output to indicate the true value of each parameter from the simulation. ```{r, eval = TRUE} tar_load(model) model ``` Now, let's assess how often the estimated 95% posterior intervals capture the true values of `beta`. If the model is implemented correctly, the coverage value below should be close to 95%. (Ordinarily, we would [increase the number of batches and reps per batch](https://books.ropensci.org/targets/dynamic.html#batching) and [run batches in parallel computing](https://books.ropensci.org/targets/hpc.html).) ```{r, eval = TRUE} library(dplyr) model %>% group_by(variable) %>% summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5)) ``` For maximum reproducibility, we should express the coverage assessment as a custom function and a target in the pipeline. ```{r, echo = FALSE, eval = TRUE} library(targets) tar_script({ library(stantargets) options(crayon.enabled = FALSE) tar_option_set( packages = "dplyr", memory = "transient", garbage_collection = TRUE ) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } list( tar_stan_mcmc_rep_summary( model, "model.stan", simulate_data(), batches = 5, # Number of branch targets. reps = 2, # Number of model reps per branch target. variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ), stdout = R.utils::nullfile(), stderr = R.utils::nullfile() ), tar_target( coverage, model %>% group_by(variable) %>% summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5)) ) ) }) ``` ```{r, eval = FALSE} # _targets.R library(targets) library(stantargets) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } list( tar_stan_mcmc_rep_summary( model, "model.stan", simulate_data(), batches = 5, # Number of branch targets. reps = 2, # Number of model reps per branch target. variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ), stdout = R.utils::nullfile(), stderr = R.utils::nullfile() ), tar_target( coverage, model %>% group_by(variable) %>% summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5)) ) ) ``` The new `coverage` target should the only outdated target, and it should be connected to the upstream `model` target. ```{r, eval = TRUE} tar_visnetwork() ``` When we run the pipeline, only the coverage assessment should run. That way, we skip all the expensive computation of simulating datasets and running MCMC multiple times. ```{r, output = FALSE, warning = FALSE, eval = TRUE} tar_make() ``` ```{r, eval = TRUE} tar_read(coverage) ``` ## Multiple models `tar_stan_rep_mcmc_summary()` and similar functions allow you to supply multiple Stan models. If you do, each model will share the the same collection of datasets, and the `.dataset_id` column of the model target output allows for custom analyses that compare different models against each other. Suppose we have a new model, `model2.stan`. ```{r, eval = TRUE} lines <- "data { int n; vector[n] x; vector[n] y; } parameters { vector[2] beta; } model { y ~ normal(beta[1] + x * x * beta[2], 1); // Regress on x^2 instead of x. beta ~ normal(0, 1); }" writeLines(lines, "model2.stan") ``` To set up the simulation workflow to run on both models, we add `model2.stan` to the `stan_files` argument of `tar_stan_rep_mcmc_summary()`. And in the coverage summary below, we group by `.name` to compute a coverage statistic for each model. ```{r, echo = FALSE, eval = TRUE} library(targets) tar_script({ library(stantargets) options(crayon.enabled = FALSE) tar_option_set( packages = "dplyr", memory = "transient", garbage_collection = TRUE ) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } list( tar_stan_mcmc_rep_summary( model, c("model.stan", "model2.stan"), # another model simulate_data(), batches = 5, reps = 2, variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ), stdout = R.utils::nullfile(), stderr = R.utils::nullfile() ), tar_target( coverage, model %>% group_by(.name, variable) %>% summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5)) ) ) }) ``` ```{r, eval = FALSE} # _targets.R library(targets) library(stantargets) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } list( tar_stan_mcmc_rep_summary( model, c("model.stan", "model2.stan"), # another model simulate_data(), batches = 5, reps = 2, variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ), stdout = R.utils::nullfile(), stderr = R.utils::nullfile() ), tar_target( coverage, model %>% group_by(.name, variable) %>% summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5)) ) ) ``` In the graph below, notice how targets `model_model` and `model_model2` are both connected to `model_data` upstream. Downstream, `model` is equivalent to `dplyr::bind_rows(model_model, model_model2)`, and it will have special columns `.name` and `.file` to distinguish among all the models. ```{r, eval = TRUE} tar_visnetwork() ``` ## Simulation-based calibration This section explores a more rigorous validation study which adopts the proper simulation-based calibration (SBC) method from [@talts2020]. To use this method, we need a function that generates rank statistics from a simulated dataset and a data frame of posterior draws. If the model is implemented correctly, these rank statistics will be uniformly distributed for each model parameter. Our function will use the [`calculate_ranks_draws_matrix()`](https://hyunjimoon.github.io/SBC/reference/calculate_ranks_draws_matrix.html) function from the [`SBC`](https://hyunjimoon.github.io/SBC/) R package [@sbc]. ```{r, eval = TRUE} get_ranks <- function(data, draws) { draws <- select(draws, starts_with(names(data$.join_data))) truth <- map_dbl( names(draws), ~eval(parse(text = .x), envir = data$.join_data) ) out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws)) as_tibble(as.list(out)) } ``` To demonstrate this function, we simulate a dataset, ```{r, eval = TRUE} simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } data <- simulate_data() str(data) ``` we make up a hypothetical set of posterior draws, ```{r, eval = TRUE} draws <- tibble(`beta[1]` = rnorm(100), `beta[2]` = rnorm(100)) draws ``` and we call `get_ranks()` to get the SBC rank statistics for each model parameter. ```{r, eval = TRUE} library(dplyr) library(posterior) library(purrr) get_ranks(data = data, draws = draws) ``` To put this into practice in a pipeline, we supply the symbol `get_ranks` to the `transform` argument of `tar_stan_mcmc_rep_draws()`. That way, instead of a full set of draws, each replication will return only the output of `get_ranks()` on those draws (plus a few helper columns). If supplied, the `transform` argument of `tar_stan_mcmc_rep_draws()` must be the name of a function in the pipeline. This function must accept arguments `data` and `draws`, and it must return a data frame. ```{r, echo = FALSE, eval = TRUE} library(targets) tar_script({ library(stantargets) options(crayon.enabled = FALSE) tar_option_set( packages = c("dplyr", "posterior", "purrr", "tibble"), memory = "transient", garbage_collection = TRUE ) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } get_ranks <- function(data, draws) { draws <- select(draws, starts_with(names(data$.join_data))) truth <- map_dbl( names(draws), ~eval(parse(text = .x), envir = data$.join_data) ) out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws)) as_tibble(as.list(out)) } list( tar_stan_mcmc_rep_draws( model, "model.stan", simulate_data(), batches = 5, reps = 2, variables = "beta", stdout = R.utils::nullfile(), stderr = R.utils::nullfile(), transform = get_ranks ) ) }) ``` ```{r, eval = FALSE} # _targets.R library(targets) library(stantargets) tar_option_set(packages = c("dplyr", "posterior", "purrr", "tibble")) simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, .join_data = list(beta = beta) ) } get_ranks <- function(data, draws) { draws <- select(draws, starts_with(names(data$.join_data))) truth <- map_dbl( names(draws), ~eval(parse(text = .x), envir = data$.join_data) ) out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws)) as_tibble(as.list(out)) } list( tar_stan_mcmc_rep_draws( model, c("model.stan"), simulate_data(), batches = 5, reps = 2, variables = "beta", stdout = R.utils::nullfile(), stderr = R.utils::nullfile(), transform = get_ranks # Supply the transform to get SBC ranks. ) ) ``` Our new function `get_ranks()` is a dependency of one of our downstream targets, so any changes to `get_ranks()` will force the results to refresh in the next run of the pipeline. ```{r, eval = TRUE} tar_visnetwork() ``` Let's run the pipeline to compute a set of rank statistics for each simulated dataset. ```{r, eval = TRUE, output = FALSE, warning = FALSE} tar_make() ``` We have a data frame of rank statistics with one row per simulation rep and one column per model parameter. ```{r, eval = TRUE} tar_load(model_model) model_model ``` If the model is implemented correctly, then each the rank statistics each model parameter should be uniformly distributed. In practice, you may need thousands of simulation reps to make a judgment. ```{r, eval = TRUE} library(ggplot2) library(tidyr) model_model %>% pivot_longer( starts_with("beta"), names_to = "parameter", values_to = "ranks" ) %>% ggplot(.) + geom_histogram(aes(x = ranks), bins = 10) + facet_wrap(~parameter) + theme_gray(12) ``` ## References