--- title: "How does CmdStanR work?" author: "Jonah Gabry and Rok Češnovar" output: rmarkdown::html_vignette: toc: true toc_depth: 4 params: EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true") vignette: > %\VignetteIndexEntry{How does CmdStanR work?} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r child="children/_settings-knitr.Rmd"} ``` ## Introduction This vignette is intended to be read after the [_Getting started with CmdStanR_](http://mc-stan.org/cmdstanr/articles/cmdstanr.html) vignette. Please read that first for important background. In this document we provide additional details about compiling models, passing in data, and how CmdStan output is saved and read back into R. We will only use the `$sample()` method in examples, but all model fitting methods work in a similar way under the hood. ```{r setup, message=FALSE} library(cmdstanr) check_cmdstan_toolchain(fix = TRUE, quiet = TRUE) ``` ## Compilation ### Immediate compilation The `cmdstan_model()` function creates a new `CmdStanModel` object. The `CmdStanModel` object stores the path to a Stan program as well as the path to a compiled executable. ```{r start-clean, include=FALSE} exe <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli") unlink(exe) ``` ```{r compile} stan_file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan") mod <- cmdstan_model(stan_file) mod$print() mod$stan_file() mod$exe_file() ``` Subsequently, if you create a `CmdStanModel` object from the same Stan file then compilation will be skipped (assuming the file hasn't changed). ```{r already-compiled} mod <- cmdstan_model(stan_file) ``` Internally, `cmdstan_model()` first creates the `CmdStanModel` object from just the Stan file and then calls its [`$compile()`](http://mc-stan.org/cmdstanr/reference/model-method-compile.html) method. Optional arguments to the `$compile()` method can be passed via `...`. ```{r compile-options, eval=FALSE} mod <- cmdstan_model( stan_file, force_recompile = TRUE, include_paths = "paths/to/directories/with/included/files", cpp_options = list(stan_threads = TRUE, STANC2 = TRUE) ) ``` ### Delayed compilation It is also possible to delay compilation when creating the `CmdStanModel` object by specifying `compile=FALSE` and then later calling the `$compile()` method directly. ```{r compile-method} unlink(mod$exe_file()) mod <- cmdstan_model(stan_file, compile = FALSE) mod$exe_file() # not yet created mod$compile() mod$exe_file() ``` ### Pedantic check If you are using CmdStan version 2.24 or later and CmdStanR version 0.2.1 or later, you can run a pedantic check for your model. CmdStanR will always check that your Stan program does not contain any invalid syntax but with pedantic mode enabled the check will also warn you about other potential issues in your model, for example: - Distribution usages issues: distribution arguments do not match the distribution specification, or some specific distribution is used in an inadvisable way. - Unused parameter: a parameter is defined but does not contribute to target. - Large or small constant in a distribution: very large or very small constants are used as distribution arguments. - Control flow depends on a parameter: branching control flow (like if/else) depends on a parameter value. - Parameter has multiple twiddles: a parameter is on the left-hand side of multiple twiddles (i.e., multiple `~` symbols). - Parameter has zero or multiple priors: a parameter has zero or more than one prior distribution. - Variable is used before assignment: a variable is used before being assigned a value. - Strict or nonsensical parameter bounds: a parameter is given questionable bounds. For the latest information on the checks performed in pedantic mode see the [Pedantic mode chapter](https://mc-stan.org/docs/stan-users-guide/pedantic-mode.html) in the Stan Reference Manual. Pedantic mode is available when compiling the model or when using the separate `$check_syntax()` method of a `CmdStanModel` object. Internally this corresponds to setting the `stanc` (Stan transpiler) option `warn-pedantic`. Here we demonstrate pedantic mode with a Stan program that is syntactically correct but is missing a lower bound and a prior for a parameter. ```{r stan_file_pedantic} stan_file_pedantic <- write_stan_file(" data { int N; array[N] int y; } parameters { // should have but omitting to demonstrate pedantic mode real lambda; } model { y ~ poisson(lambda); } ") ``` To turn on pedantic mode at compile time you can set `pedantic=TRUE` in the call to `cmdstan_model()` (or when calling the `$compile()` method directly if using the delayed compilation approach described above). ```{r pedantic-compile, collapse = TRUE} mod_pedantic <- cmdstan_model(stan_file_pedantic, pedantic = TRUE) ``` To turn on pedantic mode separately from compilation use the `pedantic` argument to the `$check_syntax()` method. ```{r pedantic-check_syntax, collapse=TRUE} mod_pedantic$check_syntax(pedantic = TRUE) ``` Using `pedantic=TRUE` via the `$check_syntax()` method also has the advantage that it can be used even if the model hasn't been compiled yet. This can be helpful because the pedantic and syntax checks themselves are much faster than compilation. ```{r pedantic-check_syntax-2, collapse=TRUE} file.remove(mod_pedantic$exe_file()) # delete compiled executable rm(mod_pedantic) mod_pedantic <- cmdstan_model(stan_file_pedantic, compile = FALSE) mod_pedantic$check_syntax(pedantic = TRUE) ``` ### Stan model variables If using CmdStan 2.27 or newer, you can obtain the names, types and dimensions of the data, parameters, transformed parameters and generated quantities variables of a Stan model using the `$variables()` method of the `CmdStanModel` object. ```{r stan_file_variables} stan_file_variables <- write_stan_file(" data { int J; vector[J] sigma; vector[J] y; } parameters { real mu; real tau; vector[J] theta_raw; } transformed parameters { vector[J] theta = mu + tau * theta_raw; } model { target += normal_lpdf(tau | 0, 10); target += normal_lpdf(mu | 0, 10); target += normal_lpdf(theta_raw | 0, 1); target += normal_lpdf(y | theta, sigma); } ") mod_v <- cmdstan_model(stan_file_variables) variables <- mod_v$variables() ``` The `$variables()` method returns a list with `data`, `parameters`, `transformed_parameters` and `generated_quantities` elements, each corresponding to variables in their respective block of the program. Transformed data variables are not listed as they are not used in the model's input or output. ```{r variables-list-names} names(variables) names(variables$data) names(variables$parameters) names(variables$transformed_parameters) names(variables$generated_quantities) ``` Each variable is represented as a list containing the type information (currently limited to `real` or `int`) and the number of dimensions. ```{r variable-type-dims} variables$data$J variables$data$sigma variables$parameters$tau variables$transformed_parameters$theta ``` ### Executable location By default, the executable is created in the same directory as the file containing the Stan program. You can also specify a different location with the `dir` argument. ```{r compile-with-dir, eval = FALSE} mod <- cmdstan_model(stan_file, dir = "path/to/directory/for/executable") ``` ## Processing data There are three data formats that CmdStanR allows when fitting a model: * named list of R objects * JSON file * R dump file ### Named list of R objects Like the RStan interface, CmdStanR accepts a named list of R objects where the names correspond to variables declared in the data block of the Stan program. In the Bernoulli model the data is `N`, the number of data points, and `y` an integer array of observations. ```{r print-program-again} mod$print() ``` ```{r data-list, eval=FALSE} # data block has 'N' and 'y' data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1)) fit <- mod$sample(data = data_list) ``` Because CmdStan doesn't accept lists of R objects, CmdStanR will first write the data to a temporary JSON file using `write_stan_json()`. This happens internally, but it is also possible to call `write_stan_json()` directly. ```{r write_stan_json} data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1)) json_file <- tempfile(fileext = ".json") write_stan_json(data_list, json_file) cat(readLines(json_file), sep = "\n") ``` ### JSON file If you already have your data in a JSON file you can just pass that file directly to CmdStanR instead of using a list of R objects. For example, we could pass in the JSON file we created above using `write_stan_json()`: ```{r data-json, eval=FALSE} fit <- mod$sample(data = json_file) ``` ### R dump file Finally, it is also possible to use the R dump file format. This is *not* recommended because CmdStan can process JSON faster than R dump, but CmdStanR allows it because CmdStan will accept files created by `rstan::stan_rdump()`: ```{r data-rdump, eval=FALSE} rdump_file <- tempfile(fileext = ".data.R") rstan::stan_rdump(names(data_list), file = rdump_file, envir = list2env(data_list)) cat(readLines(rdump_file), sep = "\n") fit <- mod$sample(data = rdump_file) ``` ## Writing CmdStan output to CSV ### Default temporary files ```{r sample-tempdir, results = "hide"} data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1)) fit <- mod$sample(data = data_list) ``` When fitting a model, the default behavior is to write the output from CmdStan to CSV files in a temporary directory. ```{r output_files} fit$output_files() ``` These files will be lost if you end your R session or if you remove the `fit` object and force (or wait for) garbage collection. ```{r gc} files <- fit$output_files() file.exists(files) rm(fit) gc() file.exists(files) ``` ### Non-temporary files To save these files to a non-temporary location there are two options. You can either specify the `output_dir` argument to `mod$sample()` or use `fit$save_output_files()` after fitting the model. ```{r save_output_files, eval=FALSE} # see ?save_output_files for info on optional arguments fit$save_output_files(dir = "path/to/directory") ``` ```{r output_dir, eval = FALSE} fit <- mod$sample( data = data_list, output_dir = "path/to/directory" ) ``` ## Reading CmdStan output into R ### Lazy CSV reading With the exception of some diagnostic information, the CSV files are not read into R until their contents are requested by calling a method that requires them (e.g., `fit$draws()`, `fit$summary()`, etc.). If we examine the structure of the `fit` object, notice how the `Private` slot `draws_` is `NULL`, indicating that the CSV files haven't yet been read into R. ```{r refit, include=FALSE} fit <- mod$sample(data = data_list) ``` ```{r csv-not-read} str(fit) ``` After we call a method that requires the draws then if we reexamine the structure of the object we will see that the `draws_` slot in `Private` is no longer empty. ```{r for-csv-reading} draws <- fit$draws() # force CSVs to be read into R str(fit) ``` For models with many parameters, transformed parameters, or generated quantities, if only some are requested (e.g., by specifying the `variables` argument to `fit$draws()`) then CmdStanR will only read in the requested variables (unless they have already been read in). ### read_cmdstan_csv() Internally, the `read_cmdstan_csv()` function is used to read the CmdStan CSV files into R. This function is exposed to users, so you can also call it directly. ```{r read_cmdstan_csv} # see ?read_cmdstan_csv for info on optional arguments controlling # what information is read in csv_contents <- read_cmdstan_csv(fit$output_files()) str(csv_contents) ``` ### as_cmdstan_fit() If you need to manually create fitted model objects from CmdStan CSV files use `as_cmdstan_fit()`. ```{r as_cmdstan_fit} fit2 <- as_cmdstan_fit(fit$output_files()) ``` This is pointless in our case since we have the original `fit` object, but this function can be used to create fitted model objects (`CmdStanMCMC`, `CmdStanMLE`, etc.) from any CmdStan CSV files. ### Saving and accessing advanced algorithm info (latent dynamics) If `save_latent_dynamics` is set to `TRUE` when running the `$sample()` method then additional CSV files are created (one per chain) that provide access to quantities used under the hood by Stan's implementation of dynamic Hamiltonian Monte Carlo. CmdStanR does not yet provide a special method for processing these files but they can be read into R using R's standard CSV reading functions. ```{r save_latent_dynamics, results = "hide"} fit <- mod$sample(data = data_list, save_latent_dynamics = TRUE) ``` ```{r read-latent-dynamics} fit$latent_dynamics_files() # read one of the files in x <- utils::read.csv(fit$latent_dynamics_files()[1], comment.char = "#") head(x) ``` The column `lp__` is also provided via `fit$draws()`, and the columns `accept_stat__`, `stepsize__`, `treedepth__`, `n_leapfrog__`, `divergent__`, and `energy__` are also provided by `fit$sampler_diagnostics()`, but there are several columns unique to the latent dynamics file. ```{r explore-latent-dynamics} head(x[, c("theta", "p_theta", "g_theta")]) ``` Our model has a single parameter `theta` and the three columns above correspond to `theta` in the _unconstrained_ space (`theta` on the constrained space is accessed via `fit$draws()`), the auxiliary momentum `p_theta`, and the gradient `g_theta`. In general, each of these three columns will exist for _every_ parameter in the model. ## Developing using CmdStanR CmdStanR can of course be used for developing other packages that require compiling and running Stan models as well as using new or custom Stan features available through CmdStan. ### Pre-compiled Stan models in R packages You may compile a Stan model at runtime (e.g. just before sampling), or you may compile all the models inside the package file system in advance at installation time. The latter avoids compilations at runtime, which matters in centrally managed R installations where users should not compile their own software. To pre-compile all the models in a package, you may create top-level scripts `configure` and `configure.win` which run `cmdstan_model()` with `compile = TRUE` and save the compiled executables somewhere inside the `inst/` folder of the package source. The [`instantiate`](https://wlandau.github.io/instantiate/) package helps developers configure packages this way, and it documents other topics such as submitting to CRAN and administering CmdStan. Kevin Ushey's [`configure`](https://github.com/kevinushey/configure) package helps create and manage package configuration files in general. ### Troubleshooting and debugging When developing or testing new features it might be useful to have more information on how CmdStan is called internally and to see more information printed when compiling or running models. This can be enabled for an entire R session by setting the option `"cmdstanr_verbose"` to `TRUE`. ```{r verbose-mode} options("cmdstanr_verbose"=TRUE) mod <- cmdstan_model(stan_file, force_recompile = TRUE) fit <- mod$sample( data = data_list, chains = 1, iter_warmup = 100, iter_sampling = 100 ) ``` ```{r include=FALSE} options("cmdstanr_verbose" = FALSE) ```