basic-workflow.Rmd
Running BayesianBombCarbon on a given dataset involves several steps. The key step is preparing the data with teh data_prep
function. This function takes both a reference series dataset and/or a sample dataset with unknown true formation/birth years. If the reference series is supplied only, the reference series is fit with a Bayesian penalized B-spline. If a sample dataset is also supplied, then a reference series is fit while also estimating the aging bias of the sample dataset. Users can also pre-determine the formation/birth years the model will predict the reference series over in the data_prep
function. The function est_model()
will take the prepared dataset and using cmdstanr
fit the designated model while extract_draws
will pull the relevant posterior parameter samples from the fitted model. Lastly, users can use plot_fit
to visualize the predicted reference series, posterior parameter distributions, and, if sample data is supplied, the observed and bias-adjusted samples.
First, we must load the R library.
The provided dataset is a data.frame
that was simulated from a hypothetical reference series with a damped oscillating linear decay function after the peak of ∆14C. Three columns are provided but only BY
and C14
are used by the data_prep
function.
# We can visualize the reference series data
with(sim_ref, plot(BY,C14, pch=16, xlab="Reference/Birth Year",
ylab=expression(Delta^14*C~"(\u2030)")))
plot of chunk sim_ref_viz
To prepare a dataset for estimation, we need to provide at least one data.frame
with named columns of BY
and C14
that provide the reference series data to the function data_prep
. Internally, data_prep
assesses what dataset(s) you’ve provided and generates a named list object with a two components, 1. a character vector in the flag
component that tells other functions what model was setup and 2. a named list with the data parsed and matched the stan
model format.
df <- data_prep(df_ref = sim_ref)
#> No values provided in df_unk, skipping validation and estimating reference series only
We can also provide further specifications of the knots used in the penalized B-spline with arguments: + knot.min
- sets the minimum number of knots, default of 10. + knot.adj
- divisor in number of knots = number of reference series data points/knot.adj, default is 4. + fixed.knot
- integer number of fixed number of knots that overrides knot.adj
calculation + spline.degree
- polynomial degree, default is 3. + pad.spline
- fraction of the range of formation/birth years in the reference series that is padded to the beginning and end of the knot locations, default is 1%.
We can also change the locations of the predicted reference series through the argument pred.by
. Prediction is accomplished within the model in the generated quantities
section in the stan
model. We can either provide a vector of formation/birth years desired to pred.by
or we can provided a named list to pred.by
to set up a sequence call: + min.by
- minimum year of the formation/birth year prediction sequence + max.by
- maximum year of the formation/birth year prediction sequence + inc.by
- formation/birth year increment of the prediction sequence
Now that we have covered the data preparation, let’s estimate the reference series. This function will also estimate other models by checking the flag
component of the data list setup by the data_prep
function. Additional arguments can be provided to est_model
that control the behavior of the cmdstanr::sample
call. Run ?cmdstanr::sample
to view the possible arguments. For most reference series, the default cmdstanr::sample
arguments work fine. You may wish to set parallel_chains
argument equal to the default number of chains=4
to achieve parallel processing. Though, estimating reference-series only models is generally extremely fast (< 10 secs on the author’s machine).
fit <- est_model(data = df, show_messages = FALSE, show_exceptions = FALSE)
The est_model
function returns a named list with model
containing the stan
model code that was sampled from and fitted
which contains the output from cmdstanr::sample
. Functions that work on CmdStanMCMC
objects produced from cmdstanr::sample
also work on this fit$fitted
object.
fit$fitted$summary(variables = c('C14_pred'))
#> # A tibble: 81 × 10
#> variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 C14_pred[1] -52.4 -52.6 5.18 5.17 -61.0 -44.1 1.00 4162. 4055.
#> 2 C14_pred[2] -52.0 -52.1 4.95 5.04 -60.2 -43.9 1.00 4041. 3736.
#> 3 C14_pred[3] -51.1 -51.0 4.92 4.80 -59.2 -43.0 1.00 3990. 3694.
#> 4 C14_pred[4] -50.2 -50.1 4.92 4.91 -58.2 -41.9 0.999 3778. 3718.
#> 5 C14_pred[5] -50.1 -50.1 4.94 4.92 -58.0 -41.7 1.00 3873. 3685.
#> 6 C14_pred[6] -50.8 -50.8 4.99 4.95 -59.1 -42.5 1.00 4062. 3688.
#> 7 C14_pred[7] -51.1 -51.1 5.10 4.99 -59.6 -42.9 1.00 4076. 3794.
#> 8 C14_pred[8] -49.5 -49.5 4.94 4.89 -57.7 -41.4 1.00 3745. 3586.
#> 9 C14_pred[9] -47.8 -47.8 5.03 5.13 -56.0 -39.6 1.00 3866. 3833.
#> 10 C14_pred[1… -46.9 -46.9 5.08 5.08 -55.4 -38.5 1.00 4187. 3873.
#> # ℹ 71 more rows
Posterior draws of parameters can be accomplished by calling cmdstanr::draws
on the fit$fitted
CmdStanMCMC
object produced through est_model
or a wrapper function is provided as extract_draws
that grabs the relevant parameters based on the the model type that was estimated. This wrapper function inherently skips over some of the nuisance parameters that are estimated pertaining to the knots in the penalized B-spline.
ext <- extract_draws(fit = fit)
Lastly, we can visualize our model fit using plot_fit
. This function takes both the data object from data_prep
and the extracted draws from extract_draws
to generate the plot.
plot_fit(df = df, ext = ext)
plot of chunk plot_ref_viz
BayesianBombCarbon
also has an integrated method for jointly estimating the reference series and the aging bias present in a set of samples with unvalidated formation/birth years. In the penalized B-spline, this requires adjusting the knot contribution to each unvalidated sample each time a new aging bias (formation/birth year adjustment) is trialed. This massively increases the runtime but, in the author’s biased opinion, is well worth the wait. The setup is very similar to the reference series only estimation.
We can used the provided simulated reference series data in sim_ref
and the accompanying unvalidated simulated samples in sim_unk
. The latter have a fixed aging bias of +1 year.
# We can visualize the reference series data
with(sim_ref, plot(BY,C14, pch=16, xlab="Reference/Birth Year",
ylab=expression(Delta^14*C~"(\u2030)"), las = 1))
with(sim_unk,points(BY,C14,pch=16,col='red'))
legend('topleft', legend = c('Reference','Test'), pch = 16, col = c('black','red'))
plot of chunk sim_unk_viz
We need to supply two data.frames
both with named columns of BY
and C14
. One is the reference series, provided to argument df_ref
, and another is the unvalidated samples, provide to arugment df_unk
in the data_prep
function.
df_int <- data_prep(df_ref = sim_ref, df_unk = sim_unk)
An additional argument ll_wt
is added to the $data
object from the data_prep
call that up-weights the reference series data relative to the unvalidated samples. This is to keep the reference series estimation from chasing the unvalidated samples. This is set at a default of 5 but if your reference series is very spotty where you have unvalidated samples, you may wish to increase this value. It has no associated runtime cost.
Just as with the reference series estimation, we call est_model
to estimate the integrated model. We do recommend adjusting the cmdnstanr::sample
arguments when running the integrated model to at least increase the warmup iterations. We also recommend setting up parallel processing with the parallel_chains
argument as these models have longer runtimes. You may also receive a warning that some transistions hit the maximum treedepth of 10. This can be resolved by changing the max_treedepth
argument to > 10.
fit_int <- est_model(data = df_int, iter_warmup = 3000, parallel_chains = 4, show_messages = FALSE, show_exceptions = FALSE)
Just as before, we can extract posterior draws using the extract_draws
wrapper function. Note, that you should provide the data_prep
object to the df
argument so that an additional posterior will be generated of BY_adj
, which adjusts the unvalidated formation/birth years by the posterior of the aging bias/adjustment.
ext_int <- extract_draws(fit = fit_int, df = df_int)
With the posterior parameter draws of the integrated model, you can conduct formal hypothesis testing of whether your ages are valid. This is an explicit advantage of the integrated approach that traditional sum of square approaches do not have. To do this, we recommend the bayestestR
package.
library(bayestestR)
p_map(ext_int$adj, null = 0)
#> MAP-based p-value
#>
#> Parameter | p (MAP)
#> -------------------
#> adj | < .001
This returns the Bayesian equivalent of the p-value, see ?bayestestR::p_map
for more details. The argument null
can be changed to also explicitly test other null ages and ask “is my aging bias significantly than X?”.
Lastly, we can visualize our integrated model using plot_fit
.
plot_fit(df = df_int, ext = ext_int)
plot of chunk plot_unk_viz
Note that additional features are plotted. The original unvalidated formation/birth years are shown in the red circles while the adjusted formation/birth years, along with their credible interval, are shown in the red ticks. Additional, the posterior densities of the aging bias/adjustment is provided. A vertical, dashed red segment will highlight zero if the posterior overlaps or is adjacent to zero. This provides a visual equivalent of the hypothesis test conducted earlier.
BayesianBombCarbon
is under development though the core functionality will remain mostly unchanged in the future. Additional models will be provided, such as the coupled function from Kastelle et al. (2011), that will have the reference series only and integrated method capabilities. General improvements have also been made to the sum of squares aging bias estimation approach, first presented in Kastelle et al. (2008), that will supported soon. If you have suggestions/requests/bugs, please reach out to zsiders@ufl.edu.