CmdStanR (Command Stan R) is a lightweight interface to Stan for R users that provides an alternative to the traditional RStan interface. See the Comparison with RStan section later in this vignette for more details on how the two interfaces differ.
Using CmdStanR requires installing the cmdstanr R package and also CmdStan, the command line interface to Stan.
# we recommend running this is a fresh R session or restarting your current session
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
We can now load the package like any other R package.
## Warning: package 'bayesplot' was built under R version 4.3.2
CmdStanR requires a working installation of CmdStan, the shell interface to Stan. If you don’t have CmdStan installed then CmdStanR can install it for you, assuming you have a suitable C++ toolchain. The requirements are described in the CmdStan Guide.
To double check that your toolchain is set up properly you can call
the check_cmdstan_toolchain()
function:
## The C++ toolchain required for CmdStan is setup properly!
If your toolchain is configured correctly then CmdStan can be
installed by calling the install_cmdstan()
function:
Before CmdStanR can be used it needs to know where the CmdStan installation is located. When the package is loaded it tries to help automate this to avoid having to manually set the path every session:
If the environment variable "CMDSTAN"
exists at load
time then its value will be automatically set as the default path to
CmdStan for the R session. This is useful if your
CmdStan installation is not located in the default
directory that would have been used by install_cmdstan()
(see #2 next).
If no environment variable is found when loaded but any directory
in the form ".cmdstan/cmdstan-[version]"
, for example
".cmdstan/cmdstan-2.23.0"
, exists in the user’s home
directory (Sys.getenv("HOME")
, not the current
working directory) then the path to the CmdStan with
the largest version number will be set as the path to CmdStan for the R
session. This is the same as the default directory that
install_cmdstan()
uses to install the latest version of
CmdStan, so if that’s how you installed CmdStan you
shouldn’t need to manually set the path to CmdStan when
loading CmdStanR.
If neither of these applies (or you want to subsequently change the
path) you can use the set_cmdstan_path()
function:
To check the path to the CmdStan installation and the CmdStan version
number you can use cmdstan_path()
and
cmdstan_version()
:
## [1] "C:/Users/wgfoo/OneDrive/Documents/.cmdstan/cmdstan-2.34.1"
## [1] "2.34.1"
The cmdstan_model()
function creates a new CmdStanModel
object from a file containing a Stan program. Under the hood, CmdStan is
called to translate a Stan program to C++ and create a compiled
executable. Here we’ll use the example Stan program that comes with the
CmdStan installation:
file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(file)
The object mod
is an R6 reference object of class CmdStanModel
and behaves similarly to R’s reference class objects and those in object
oriented programming languages. Methods are accessed using the
$
operator. This design choice allows for
CmdStanR and CmdStanPy to provide a
similar user experience and share many implementation details.
The Stan program can be printed using the $print()
method:
## data {
## int<lower=0> N;
## array[N] int<lower=0,upper=1> y;
## }
## parameters {
## real<lower=0,upper=1> theta;
## }
## model {
## theta ~ beta(1,1); // uniform prior on interval 0,1
## y ~ bernoulli(theta);
## }
The path to the compiled executable is returned by the
$exe_file()
method:
## [1] "C:/Users/wgfoo/OneDrive/Documents/.cmdstan/cmdstan-2.34.1/examples/bernoulli/bernoulli.exe"
The $sample()
method for CmdStanModel
objects runs Stan’s default MCMC algorithm. The data
argument accepts a named list of R objects (like for RStan) or a path to
a data file compatible with CmdStan (JSON or R dump).
# names correspond to the data block in the Stan program
data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))
fit <- mod$sample(
data = data_list,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh = 500 # print update every 500 iterations
)
## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1 Iteration: 500 / 2000 [ 25%] (Warmup)
## Chain 1 Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1 Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 1 Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2 Iteration: 500 / 2000 [ 25%] (Warmup)
## Chain 2 Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2 Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 3 Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3 Iteration: 500 / 2000 [ 25%] (Warmup)
## Chain 3 Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3 Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3 Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 4 Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4 Iteration: 500 / 2000 [ 25%] (Warmup)
## Chain 4 Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4 Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1 finished in 0.1 seconds.
## Chain 2 Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2 finished in 0.1 seconds.
## Chain 3 Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3 finished in 0.0 seconds.
## Chain 4 Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 4 Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4 finished in 0.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.6 seconds.
There are many more arguments that can be passed to the
$sample()
method. For details follow this link to its
separate documentation page:
The $sample()
method creates R6 CmdStanMCMC
objects,
which have many associated methods. Below we will demonstrate some of
the most important methods. For a full list, follow this link to the
CmdStanMCMC
documentation:
The $summary()
method calls summarise_draws()
from the
posterior package. The first argument specifies the
variables to summarize and any arguments after that are passed on to
posterior::summarise_draws()
to specify which summaries to
compute, whether to use multiple cores, etc.
fit$summary()
fit$summary(variables = c("theta", "lp__"), "mean", "sd")
# use a formula to summarize arbitrary functions, e.g. Pr(theta <= 0.5)
fit$summary("theta", pr_lt_half = ~ mean(. <= 0.5))
# summarise all variables with default and additional summary measures
fit$summary(
variables = NULL,
posterior::default_summary_measures(),
extra_quantiles = ~posterior::quantile2(., probs = c(.0275, .975))
)
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## 1 lp__ -7.26 -6.98 0.73 0.31 -8.729 -6.75 1 1735 1632
## 2 theta 0.24 0.23 0.12 0.12 0.079 0.45 1 1512 1392
## variable mean sd
## 1 theta 0.24 0.12
## variable pr_lt_half
## 1 theta 0.97
print.data.frame(fit$summary(
variables = NULL,
posterior::default_summary_measures(),
extra_quantiles = ~posterior::quantile2(., probs = c(.0275, .975))
))
## variable mean median sd mad q5 q95 q2.75 q97.5
## 1 lp__ -7.26 -6.98 0.73 0.31 -8.729 -6.75 -9.217 -6.7
## 2 theta 0.24 0.23 0.12 0.12 0.079 0.45 0.064 0.5
CmdStan itself provides a stansummary
utility that can
be called using the $cmdstan_summary()
method. This method
will print summaries but won’t return anything.
The $draws()
method can be used to extract the posterior draws in formats provided by
the posterior
package. Here we demonstrate only the draws_array
and
draws_df
formats, but the posterior
package supports other useful formats as well.
# default is a 3-D draws_array object from the posterior package
# iterations x chains x variables
draws_arr <- fit$draws() # or format="array"
str(draws_arr)
## 'draws_array' num [1:1000, 1:4, 1:2] -6.76 -7.28 -6.83 -7.64 -7.85 ...
## - attr(*, "dimnames")=List of 3
## ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
## ..$ chain : chr [1:4] "1" "2" "3" "4"
## ..$ variable : chr [1:2] "lp__" "theta"
## draws_df [4,000 × 5] (S3: draws_df/draws/tbl_df/tbl/data.frame)
## $ lp__ : num [1:4000] -6.76 -7.28 -6.83 -7.64 -7.85 ...
## $ theta : num [1:4000] 0.23 0.391 0.202 0.435 0.457 ...
## $ .chain : int [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
## $ .iteration: int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
## $ .draw : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
## # A draws_df: 1000 iterations, 4 chains, and 2 variables
## lp__ theta
## 1 -6.8 0.230
## 2 -7.3 0.391
## 3 -6.8 0.202
## 4 -7.6 0.435
## 5 -7.8 0.457
## 6 -7.8 0.457
## 7 -6.8 0.226
## 8 -8.9 0.063
## 9 -8.7 0.068
## 10 -7.2 0.142
## # ... with 3990 more draws
## # ... hidden reserved variables {'.chain', '.iteration', '.draw'}
To convert an existing draws object to a different format use the
posterior::as_draws_*()
functions.
# this should be identical to draws_df created via draws(format = "df")
draws_df_2 <- as_draws_df(draws_arr)
identical(draws_df, draws_df_2)
## [1] TRUE
In general, converting to a different draws format in this way will
be slower than just setting the appropriate format initially in the call
to the $draws()
method, but in most cases the speed
difference will be minor.
Plotting posterior distributions is as easy as passing the object
returned by the $draws()
method directly to plotting
functions in our bayesplot
package.
The $sampler_diagnostics()
method extracts the values of the sampler parameters
(treedepth__
, divergent__
, etc.) in formats
supported by the posterior package. The default is as a
3-D array (iteration x chain x variable).
## 'draws_array' num [1:1000, 1:4, 1:6] 2 1 1 2 1 1 1 1 1 2 ...
## - attr(*, "dimnames")=List of 3
## ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
## ..$ chain : chr [1:4] "1" "2" "3" "4"
## ..$ variable : chr [1:6] "treedepth__" "divergent__" "energy__" "accept_stat__" ...
## draws_df [4,000 × 9] (S3: draws_df/draws/tbl_df/tbl/data.frame)
## $ treedepth__ : num [1:4000] 2 1 1 2 1 1 1 1 1 2 ...
## $ divergent__ : num [1:4000] 0 0 0 0 0 0 0 0 0 0 ...
## $ energy__ : num [1:4000] 7.1 7.62 7.84 7.64 8.01 ...
## $ accept_stat__: num [1:4000] 1 0.892 0.927 0.949 0.948 ...
## $ stepsize__ : num [1:4000] 0.837 0.837 0.837 0.837 0.837 ...
## $ n_leapfrog__ : num [1:4000] 3 3 3 3 1 1 3 3 1 3 ...
## $ .chain : int [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
## $ .iteration : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
## $ .draw : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
The $diagnostic_summary()
method will display any
sampler diagnostic warnings and return a summary of diagnostics for each
chain.
## $num_divergent
## [1] 0 0 0 0
##
## $num_max_treedepth
## [1] 0 0 0 0
##
## $ebfmi
## [1] 0.97 1.14 1.08 1.34
We see the number of divergences for each of the four chains, the number of times the maximum treedepth was hit for each chain, and the E-BFMI for each chain.
In this case there were no warnings, so in order to demonstrate the warning messages we’ll use one of the CmdStanR example models that suffers from divergences.
## In file included from stan/lib/stan_math/stan/math/prim/prob/von_mises_lccdf.hpp:5,
## from stan/lib/stan_math/stan/math/prim/prob/von_mises_ccdf_log.hpp:4,
## from stan/lib/stan_math/stan/math/prim/prob.hpp:359,
## from stan/lib/stan_math/stan/math/prim.hpp:16,
## from stan/lib/stan_math/stan/math/rev.hpp:16,
## from stan/lib/stan_math/stan/math.hpp:19,
## from stan/src/stan/model/model_header.hpp:4,
## from C:/Users/wgfoo/AppData/Local/Temp/Rtmp0M3N0N/model-74ec11751a19.hpp:2:
## stan/lib/stan_math/stan/math/prim/prob/von_mises_cdf.hpp: In function 'stan::return_type_t<T_x, T_sigma, T_l> stan::math::von_mises_cdf(const T_x&, const T_mu&, const T_k&)':
## stan/lib/stan_math/stan/math/prim/prob/von_mises_cdf.hpp:194: note: '-Wmisleading-indentation' is disabled from this point onwards, since column-tracking was disabled due to the size of the code/headers
## 194 | if (cdf_n < 0.0)
## |
## stan/lib/stan_math/stan/math/prim/prob/von_mises_cdf.hpp:194: note: adding '-flarge-source-files' will allow for more column-tracking support, at the expense of compilation time and memory
## Warning: 101 of 4000 (3.0%) transitions ended with a divergence.
## See https://mc-stan.org/misc/warnings for details.
After fitting there is a warning about divergences. We can also
regenerate this warning message later using
fit$diagnostic_summary()
.
## Warning: 101 of 4000 (3.0%) transitions ended with a divergence.
## See https://mc-stan.org/misc/warnings for details.
## $num_divergent
## [1] 33 27 26 15
##
## $num_max_treedepth
## [1] 0 0 0 0
##
## $ebfmi
## [1] 0.37 0.33 0.45 0.39
# number of divergences reported in warning is the sum of the per chain values
sum(diagnostics$num_divergent)
## [1] 101
CmdStan itself provides a diagnose
utility that can be
called using the $cmdstan_diagnose()
method. This method
will print warnings but won’t return anything.
stanfit
objectIf you have RStan installed then it is also possible to create a
stanfit
object from the csv output files written by
CmdStan. This can be done by using rstan::read_stan_csv()
in combination with the $output_files()
method of the
CmdStanMCMC
object. This is only needed if you want to fit
a model with CmdStanR but already have a lot of post-processing code
that assumes a stanfit
object. Otherwise we recommend using
the post-processing functionality provided by CmdStanR itself.
There are additional vignettes available that discuss other aspects of using CmdStanR. These can be found online at the CmdStanR website:
To ask a question please post on the Stan forums:
To report a bug, suggest a feature (including additions to these vignettes), or to start contributing to CmdStanR development (new contributors welcome!) please open an issue on GitHub: