Although a single chain will probably sample from one of the modes, multiple chains might end up in different modes, which can complicate the analysis of the posterior samples (e.g., the automatic multi-chain convergence statistics will probably not like this). The horseshoe prior can be applied on all population-level effects at once (excluding the intercept) by using set_prior ("horseshoe (1)"). Sparsity information and regularization in the horseshoe and other shrinkage priors. The following shows how to use the Horseshoe in Stan. . . R. R has many tools for Bayesian analysis, and possessed these before Stan came around. This prior captures the belief that regression coefficients are rather likely to be zero (the bet on sparsity). The following shows how to use the Horseshoe in Stan. On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Here’s more horseshoe from PyStan developer Allen Riddell, who writes: Betting that only a subset of the explanatory variables are useful for prediction is a bet on sparsity. ... (nm) { stan_model (file.path ("stan… A Bayesian competitor to the Lasso makes use of the “Horseshoe prior” (which I’ll call “the Horseshoe” for symmetry). Spike-and-slab vs horseshoe prior Spike and slab prior (with point-mass at zero) has mix of continuous prior and probability mass at zero parameter space is mixture of continuous and discrete Hierarchical shrinkage and horseshoe priors are continuous Piironen and Vehtari (2017a) \lambda_i &\sim& \mbox{Cauchy}^+(0, 1) \quad . \eqalign{ \beta_i &\sim& \mbox{N}(0, \tau^2\tilde\lambda_i^2) \\ vector[J] lambda; Why tau isn't constrained to be positive? We have used uninformative priors for the treatment effects but slightly informative half-cauchy priors for the three variances. \(\tau\) sets the total amount of influence the covariates have on the response, rather like \(\lambda\) in the Lasso. Another shrinkage prior is the so-called lasso prior. As Pironen and Vehtari explain, however, there hasn’t been consensus on how to set or estimate \(\tau\). Sparsity information and regularization in the horseshoe and other shrinkage priors. Specifically, \[ \lambda_i &\sim& \mbox{Cauchy}^+(0, 1) \quad , The bad news is that neither of the prediction intervals include the true value. The key parameter tau can be equipped with This may, however, lead to an increased number of … Below, we explain its usage and list some common prior dist… theta_step ~ normal(0, 1); I followed the advice in the help page for “Prior distributions and options” in the rstanarm package, i.e., help("priors"). This may, however, lead to an increased number of divergent transition in Stan. This prior captures the belief that regression coefficients are rather likely to be zero (the bet on sparsity). . 2017. } What’s a horeshoe prior you ask? Imagine my disappoint. That’s the good news. Also, there is prior = hs () or prior = hs_plus () that implement hierarchical shrinkage on the coefficients. \]. Thanks! Notice that if \(\tau^2\lambda_i^2 \ll c^2\), meaning that \(\beta_i\) is close to 0, then we have something very close to the original horseshoe prior. If on the other hand, \(\tau^2\lambda_i^2 \gg c^2\) then the prior on \(\beta_i\) is close to \(\mbox{N}(0, c^2)\). Those are pretty close to the Lasso predictions, but we have the advantage that they include an indication of how reliable the estimates are, and you can see that we don’t have good evidence that the predictions are different from one another even though the point estimates look fairly different. Accordingly, increasing the degrees of freedom to slightly higher values The functions prior, prior_, andprior_string are aliases of set_prior each allowingfor a different kind of argument specification. the LASSO) and Student-t priors (e.g. theta <- (theta_step . This may, however, lead to an increased number of … where \(\mbox{Cauchy}^+\) refers to a half-Cauchy distribution on the positive real line. The basic horsehoe prior affects only the last of these. Using this Bayesian approach, however, we can see that even though x3 isn’t “significant” in the analysis of the first data set and is in the second,5 we don’t have good evidence that the estimates are different, because the posterior distributions are broadly overlapping as evidenced by the broadly overlapping credible intervals. BEST t-test, linear regression (Compare with BUGS version, JAGS), mixed model, mixed model with correlated random effects, beta regression, mixed model with beta response , mixture model, topic model, multinomial models, multilevel mediation, variational bayes regression, gaussian process, horseshoe prior, item response theory, … EM They introduce a new version of the horseshoe prior, the regularized horseshoe prior that looks like this, \[ r2OpenBugs), one of its dialects JAGS (rjags), and packages like coda and MCMCpack that allowed for customized approaches, further extensions or easier implementation. The fixed effects have been collected into a … y_i &\sim& \mbox{N}(\mu_i, \sigma^2) \\ Piironen, Juho and Vehtari, Aki (2017c). (I would compute theta as product of theta_step and lambda in the code rather than re-doing the lambda_step and tau product for theta.). You fit the Stan model with the horseshoe prior or whatever, then you post-process and set parameters to exactly zero. What is a horsehoe prior? \beta_i &\sim& \mbox{N}(0, \tau^2\lambda_i^2) \\ \]. In this case you can see that the 95% credible intervals overlap 0 for every coefficient in analysis of the first data set, while only the credible intervals for x3 don’t overlap 0 in analysis of the second data set. This repository includes some Stan codes for survival analysis with shrinkage priors (Gaussian, Laplace, and horseshoe) and Weibull observation model. . We’ll just use hs() as a prior in our model, setting the global_scale parameter according to their advice. . Stan fits for all models are generated using \(sampling\) ... As previously discussed, all experiments are performed with Horseshoe prior and the main objective of the section is to used methods discussed to obtain optimal parametrization for the prior. The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. I simply modified the code at https://mc-stan.org/projpred/articles/quickstart.html.6 I haven’t played around with other values, but I encourage you to try some. \mu_i &=& \beta_0 + \sum_j \beta_jx_{ij} \sigma &\sim& \mbox{Exp}(1) \\ tau ~ cauchy(0, 1.0/J); Stan-code given for the regression coefficients in the first PS.-link: beta[i] ~ normal(0, square(lambda[i]) * square(tau)); I think the “square”-calls should be dropped, since stan uses standard deviation parametrization rather than variance for normal: In the third link, horseshoe prior is given by: transformed parameters { lower mean squared error . I thought this was going to be a study of how many Bayesians made it to the next round in the horseshoes tournament in Stanton. In simulations, the horseshoe+ estimator demonstrates superior performance in a standard design setting against competing methods, including the horseshoe and Dirichlet-Laplace estimators. . This induces the I don't understand a couple of things here. By the way, the second link is broken. . set_prior is used to define prior distributions for parameters in brms models. If you really want to know and understand horseshoe priors, you’ll need to read a paper by Juho Pironen and Aki Vehtari in the Electronic Journal of Statistics,1 but here’s a brief outline. . We compare the Gaussian, Laplace and horseshoe shrinkage priors, and find that the last has the best predictive performance and shrinks strong predictors less than the others. As long as the prior is zero mean, tau doesn’t strictly need to be constrained to be positive (if tau changes sign, one can change the signs of theta_step to give the same theta). Is there a way to do this in JAGS, or do I need to switch to STAN? If you really want to know and understand horseshoe priors, you’ll need to read a paper by Juho Pironen and Aki Vehtari in the Electronic Journal of Statistics, 1 but here’s a brief outline.. Horseshoe prior method o Implementation using Stan and the brms package for a wide variety of models Knockoff method o Simulation study o Comparison with Horseshoe For example, instead of 3 in the code above try 1 and 8 to see how different your results are. Consistent with the pure STAN version, we will employ the following priors: weakly informative Gaussian prior for the intercept $\beta_0 \sim{} N(0, 100)$ weakly informative Gaussian prior for the treatment effect $\beta_1 \sim{} N(0, 100)$ half-cauchy prior for the variance $\sigma \sim{} Cauchy(0, 5)$ Note, I am using the refresh=0. (Again, the true values differ because the scaling differs between the data sets.) A special shrinkage prior to be applied on p opulation-level effects is the horseshoe prior ( Carvalho, Polson, and Scott 2009 , 2010 ). The hierarchical shrinkage (hs) prior in the rstanarm package instead utilizes a regularized horseshoe prior, as described by Piironen and Vehtari (2017), which recommends setting the global_scale argument equal to the ratio of the expected number of non-zero coefficients to the expected number of zero coefficients, divided by the square root of the number of observations. vector[J] theta; Plotting the estimates and their uncertainty makes is much easier to pick out the covariates that seem to have an association with the response variable.4 Note: The outer intervals in these plots correspond to 90% credible intervals, not 95% credible intervals. lambda <- lambda_step * tau; * lambda_step) * tau; I'm building a multi-level Bayesian model using rJAGS and I would like to specify a Cauchy prior for several of my parameters. Notice, however, that some of the intervals are very close to not overlapping 0, i.e., the intervals for x1, x3, and x9 in the analysis of data set 1 and the intervals for x1, x2, and x6 in the analysis of data set 2. If you look back at earlier installments in this series, you’ll see that the point estimates we got here aren’t too different from what we’ve seen before, e.g., 0.185 for x1 in data set 1 here vs. 0.239 in data set 1 from the Lasso. I can also plot the estimates and their uncertainty very easily. . A Bayesian competitor to the Lasso makes use of the “Horseshoe prior” (which I’ll call “the Horseshoe” for symmetry). concentrates at a rate faster than that of the horseshoe in the Kullback-Leibler (K-L) sense . \eqalign{ Well you can see a few of the details in the next section, or you can just skip to the section where we use it in rstanarm if you find the math more confusing than helpful. See above (in HTML) or here (in an R notebook) to see how, using the builtin mtcars data set as an example.↩, If you’ve been following along this far, you probably don’t regard this last step as qualifying for the word “just”, but it is the last step - really.↩, If you’re uncomfortable with Bayesian inference, it’s worth noting that the covariates identified in the two analyses are pretty similar. Now we just3 need a prior on \(c\). And the Horseshoe+ prior from Anindya Bhadra, Jyotishka Datta, Nicholas Polson, and Brandon Willard, who write: The horseshoe+ prior is a natural extension of the horseshoe prior . In non-linear models, population-level effects are … Tomi Peltola, Aki Havulinna, Veikko Salomaa, and Aki Vehtari write: This paper describes an application of Bayesian linear survival regression . } Using Laplace (double exponential) priors, although analagous to the the lasso, is not an example. In fact, I singled out the same covariates as important in the analysis of data set 2 here as the Lasso identified, and all of the covariates I identified here in the analysis of data set 1 were also identified in analysis using the Lasso (the Lasso identified three more).↩, In the sense that the 95% credible interval for x3 overlaps 0 in analysis of the first data set and doesn’t in the second.↩, If you visit https://mc-stan.org/projpred/articles/quickstart.html, you’ll see that it describes a package called prodprej. } } The horseshoe prior is an example of this. See lasso for details. The 1 implies that the student-t prior of the local shrinkage parameters has 1 degrees of freedom. It also turns out that there’s an easy way to do it, since “horseshoe priors” are built into rstanarm. I set it to “the ratio of the expected number of non-zero coefficients divided by the square root of the number of observations.” In fact, I cheated even a little more than that. Abstract The horseshoe prior has proven to be a note- worthy alternative for sparse Bayesian estima- tion, but as shown in this paper, the results can be sensitive to … P.S. \]. . I get an assessment of how reliable estimates of the regression coefficients are in addition to a point estimate of what they are. Horseshoe priors are similar to lasso and other regularization techniques, but have been found to have better performance in many situations. \eqalign{ \beta_i &\sim& \mbox{N}(0, 2.5) \quad \mbox{for } i > 0 y ~ normal(theta, 1); Read Pironen and Vehtari if you want all of the gory details. Package ‘horseshoe’ July 18, 2019 Title Implementation of the Horseshoe Prior Version 0.2.0 Description Contains functions for applying the horseshoe prior to high-dimensional linear regression, yielding the posterior mean and credible intervals, amongst other things. The horseshoe prior is a member of the family of multivariate scale mixtures of normals, and is therefore closely related to widely used ap- proaches for sparse Bayesian learning, includ- ing, among others, Laplacian priors (e.g. You could change that by specifying prob_outer = 0.95 in the call to plot(). What about this idea of rapid antigen testing. . Horseshoe prior是一种稀疏bayes监督学习的方法。通过对模型参数的先验分布中加入稀疏特征,从而得到稀疏的估计。 horseshoe prior属于multivariate scale mixtures of normals的分布族。所以和其他常用的稀疏bayes学习方法,Laplacian prior, (Lasso), Student-t prior,非常类似。 lambda_step is Cauchy(0,1), so lambda computed as product of lambda_step and tau is Cauchy(0, tau), which is how lambda is defined in the model. }. Just imagine how much trickier it would be if the true relationship were non-linear rather than linear. model { The likelihood in standard linear regression model looks like this The explanation is simple. The 1 implies that the student-t prior of the local shrinkage parameters has 1 degrees of freedom. If you’re still paying attention (a) I congratulate you and (b) I owe you an explanation of how I set prior_coeff in the call to stan_glm(). The 1implies that the student-t prior of the local shrinkage parameters has 1 degrees of freedom. A popular model making this bet is the Lasso or, less handily, L1-regularized regression. Still, I would probably recommend constraining it to be positive to remove the sign ambiguity as the ambiguity will make the posterior multimodal for tau and theta_step. \], To complete the Bayesian model, we need to specify priors for \(\sigma^2\) and the \(\beta\)s. The default choices in rstanarm are:2, \[ My JAGS model is below. This … \beta_0 &\sim& \mbox{N}(0, 10) \\ And why lambda can be calculated as a product of lambda_step and tau? A simple, one-variable Bayesian linear regression model using a horseshoe prior. The nice thing about “horseshoe priors” in rstanarm is that if you know how to set up a regression in stan_glm() or stan_glmer() you can use a horseshoe prior very easily in your analysis simply by changing the prior parameter in your call to one of those functions. prior allows specifying arguments as expression withoutquotation marks using non-standard evaluation. It won’t come as a surprise to anyone who knows me that I have to try a Bayesian approach to variable selection. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coecients, which can be problematic with weakly identied parameters, such as the logistic regression coecients in the case of data separation. . If you’re getting the message that out of sample extrapolation is tricky, you’re getting the right message. 4.1 Experiments with 1 replication. Once again we have to start by generating the data. A special shrinkage prior to be applied on population-level effects is the (regularized) horseshoe prior and related priors. A Cauchy distribution has very fat tails, so the tendency for any \(\beta_i\) is to be either close to zero, because \(\lambda_i\) is small, or well away from zero, because \(\lambda_i\) is large. A regression coefficient β i, where i ∈ { 1, D } predictors, has a horseshoe prior if its standard deviation is the product of a local (λ i) and global (τ) scaling parameter. We discussed horseshoe in Stan awhile ago, and there’s more to be said on this topic, including the idea of postprocessing the posterior inferences if there’s a desire to pull some coefficients all the way to zero. \eqalign{ . Pironen, J., and A. Vehtari. The hierarchical shrinkage (hs) prior in the rstanarm package instead utilizes a regularized horseshoe prior, as described by Piironen and Vehtari (2017), which recommends setting the global_scale argument equal to the ratio of the expected number of non-zero coefficients to the expected number of zero coefficients, divided by the square root of the number of observations. See horseshoe for details. Now that we have the data, let’s try the horshoe prior. The horseshoe prior can be applied on all population-level effects at once (excluding the intercept) by using set_prior ("horseshoe (1)"). I'd like to replace the dnorm distributions with Cauchy, but JAGS cannot find the standard R Cauchy distributions, e.g. The generalized horseshoe [1] places a beta prior distribution over the coe cient of shrinkage, i.e., 2 j (1+ 2 j) 1 ˘Beta(a;b). lambda_step ~ cauchy(0, 1); See the reference for the model description (note that the priors on a_c, b_c, a_s, and b_s have been changed to half-normal in the codes). (excluding the intercept) by using set_prior("horseshoe(1)"). In other words, if we pay appropriate attention to the uncertainty of our estimates, they’re pretty stable (at least across the two sample data sets we’ve been exploring). Electronic Journal of Statistics 11(2):5018-5051. doi: 10.1214/17-EJS1337SI↩, Note: You can check this easily using prior_summary(). An of course, if one were to change the code to use Cauchy(0, tau) for lambda, negative tau values would not make sense. . \tilde\lambda_i &=& \frac{c^2\lambda_i^2}{c^2 + \tau^2\lambda_i^2} \\ Among the more prominent were those that allowed the use of BUGS (e.g. We’ll explore that in the next installment.↩, ---
title: "A Bayesian approach to variable selection using horseshoe priors"
output: html_notebook
---

It won't come as a surprise to anyone who knows me that I have to try a Bayesian approach to variable selection. It also turns out that there's an easy way to do it, since "horseshoe priors" are built into `rstanarm`. What's a horeshoe prior you ask? Well you can see a few of the details in the next section, or you can just skip to the section where we use it in `rstanarm` if you find the math more confusing than helpful.

## What is a horsehoe prior?

If you really want to know and understand horseshoe priors, you'll need to read a paper by Juho Pironen and Aki Vehtari in the _Electronic Journal of Statistics_,[^1] but here's a brief outline.

The likelihood in standard linear regression model looks like this

$$
\eqalign{
y_i &\sim& \mbox{N}(\mu_i, \sigma^2) \\
\mu_i &=& \beta_0 + \sum_j \beta_jx_{ij}  
}
$$

To complete the Bayesian model, we need to specify priors for $\sigma^2$ and the $\beta$s. The default choices in `rstanarm` are:[^2]

$$
\eqalign{
\sigma &\sim& \mbox{Exp}(1) \\
\beta_0 &\sim& \mbox{N}(0, 10) \\
\beta_i &\sim& \mbox{N}(0, 2.5) \quad \mbox{for } i > 0
}
$$

These are the choices we used [here](http://darwin.eeb.uconn.edu/pages/variable-selection/reducing-the-number-of-covariates.nb.html). The basic horsehoe prior affects only the last of these. Specifically,

$$
\eqalign{
\beta_i &\sim& \mbox{N}(0, \tau^2\lambda_i^2) \\
\lambda_i &\sim& \mbox{Cauchy}^+(0, 1) \quad ,
}
$$

where $\mbox{Cauchy}^+$ refers to a half-Cauchy distribution on the positive real line. A Cauchy distribution has very fat tails, so the tendency for any $\beta_i$ is to be either close to zero, because $\lambda_i$ is small, or well away from zero, because $\lambda_i$ is large. $\tau$ sets the total amount of influence the covariates have on the response, rather like $\lambda$ in the [Lasso](http://darwin.eeb.uconn.edu/pages/variable-selection/using-the-lasso.nb.html).

As Pironen and Vehtari explain, however, there hasn't been consensus on how to set or estimate $\tau$. They introduce a new version of the horseshoe prior, the _regularized horseshoe prior_ that looks like this

$$
\eqalign{
\beta_i &\sim& \mbox{N}(0, \tau^2\tilde\lambda_i^2) \\
\tilde\lambda_i &=& \frac{c^2\lambda_i^2}{c^2 + \tau^2\lambda_i^2} \\
\lambda_i &\sim& \mbox{Cauchy}^+(0, 1) \quad .
}
$$

Notice that if $\tau^2\lambda_i^2 \ll c^2$, meaning that $\beta_i$ is close to 0, then we have something very close to the original horseshoe prior. If on the other hand, $\tau^2\lambda_i^2 \gg c^2$ then the prior on $\beta_i$ is close to $\mbox{N}(0, c^2)$. Now we just[^3] need a prior on $c$. Read Pironen and Vehtari if you want all of the gory details. We'll just use `hs()` as a prior in our model, setting the `global_scale` parameter according to their advice.

## Setting up the data

Once again we have to start by generating the data.
```{r setup, warning = FALSE, message = FALSE}
library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
library(corrplot)

rm(list = ls())
```

```{r}
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a 
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE, 
              data = c(1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1
                       ))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
  ## get the correlation matris
  ##
  Rho <- Rho*rho
  for (i in 1:ncol(Rho)) {
    Rho[i,i] <- 1.0
  }
  ## notice the use of matrix multiplication
  ##
  Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
  return(Sigma)
}

## set the random number seed manually so that every run of the code will 
## produce the same numbers
##
set.seed(1234)

n_samp <- 100
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(2.0, n_samp)

y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names

## saving results in scale allows me to use them later for prediction with
## new data
##
scale_1 <- lapply(dat_1[, 1:10], scale)
scale_2 <- lapply(dat_2[, 1:10], scale)

## when assigning the same scaling to a data frame, the scaling attributes
## are lost
##
dat_1[, 1:10] <- lapply(dat_1[, 1:10], scale)
dat_2[, 1:10] <- lapply(dat_2[, 1:10], scale)
```

## Trying the horseshoe prior

Now that we have the data, let's try the horshoe prior.

```{r, warning = FALSE, message = FALSE}
library(rstanarm)

options(mc.cores = parallel::detectCores())

set.seed(1234)

## n is the number of observations
## D is the number of covariates
## p0 is the expected number of important covariates
##
n <- nrow(dat_1)
D <- ncol(dat_1[,2:10])
p0 <- 3
tau0 <- p0/(D - p0) * 1/sqrt(n)
prior_coeff <- hs(global_scale = tau0, slab_scale = 1)

fit_1 <- stan_glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,
                  data = dat_1,
                  prior = prior_coeff,
                  refresh = 0,
                  adapt_delta = 0.999)
## Note: I use the same prior scale here because both data sets have the same number of observations
## and the same expected number of important covariates
##
fit_2 <- stan_glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,
                  data = dat_2,
                  prior = prior_coeff,
                  refresh = 0,
                  adapt_delta = 0.999)

predict_1 <- posterior_predict(fit_1)
predict_2 <- posterior_predict(fit_2)

for.plot <- data.frame(Observed = dat_1$y, Predicted = apply(predict_1, 2, mean))
p <- ggplot(for.plot, aes(x = Observed, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  geom_smooth(method = "lm") +
  ggtitle("Data set 1")
print(p)

for.plot <- data.frame(Observed = dat_2$y, Predicted = apply(predict_2, 2, mean))
p <- ggplot(for.plot, aes(x = Observed, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  geom_smooth(method = "lm") +
  ggtitle("Data set 2")
print(p)

cat("Results from data set 1\n")
summary(fit_1, digits = 3)

cat("\n\nResults from data set 2\n")
summary(fit_2, digits = 3)
```

There are several things I like about using regularized horeshoe priors in `rstanarm` rather than the Lasso.  

1. I get an assessment of how reliable estimates of the regression coefficients are in addition to a point estimate of what they are. In this case you can see that the 95% credible intervals overlap 0 for every coefficient in analysis of the first data set, while only the credible intervals for `x3` don't overlap 0 in analysis of the second data set. Notice, however, that some of the intervals are very close to not overlapping 0, i.e., the intervals for `x1`, `x3`, and `x9` in the analysis of data set 1 and the intervals for `x1`, `x2`, and `x6` in the analysis of data set 2.

2. I can also plot the estimates and their uncertainty very easily. Plotting the estimates and their uncertainty makes is much easier to pick out the covariates that seem to have an association with the response variable.[^4] Note: The outer intervals in these plots correspond to 90% credible intervals, not 95% credible intervals. You could change that by specifying `prob_outer = 0.95` in the call to `plot()`.

```{r}
p <- plot(fit_1) + ggtitle("Estimates from data set 1")
print(p)
p <- plot(fit_2) + ggtitle("Estimates from data set 2")
print(p)
```

3. I can use the regularized horsehoe prior in a generalized mixed model with `stan_glmer()`. There is a package to fit the Lasso to a generalized mixed model with `lmer()`, but I haven't tried it, and it isn't built into `lmer()`.

### How stable are the predictions?

If you look back at earlier installments in this series, you'll see that the point estimates we got here aren't too different from what we've seen before, e.g., 0.185 for `x1` in data set 1 here _vs_. 0.239 in data set 1 from the Lasso. Using this Bayesian approach, however, we can see that even though `x3` isn't "significant" in the analysis of the first data set and is in the second,[^5] we don't have good evidence that the estimates are different, because the posterior distributions are broadly overlapping as evidenced by the broadly overlapping credible intervals. In other words, if we pay appropriate attention to the uncertainty of our estimates, they're pretty stable (at least across the two sample data sets we've been exploring).

What about out of sample predictions?

```{r}
new_data <- data.frame(x1 = 4.0, x2 = 4.0, x3 = 4.0,
                       x4 = 4.0, x5 = 4.0, x6 = 4.0,
                       x7 = 4.0, x8 = 4.0, x9 = 4.0)


## re-scale the new data by subtracting mean and dividing by standard deviation
##
new_data_1 <- (new_data - lapply(scale_1[2:10], attr, "scaled:center")) /
               lapply(scale_1[2:10], attr, "scaled:scale")
new_data_2 <- (new_data - lapply(scale_2[2:10], attr, "scaled:center")) /
               lapply(scale_2[2:10], attr, "scaled:scale")
  
predict_1 <- posterior_predict(fit_1, new_data_1)
predict_2 <- posterior_predict(fit_2, new_data_2)

summarize_posterior <- function(x, credible = 0.95, digits = 3) {
  lo_p <- (1.0 - credible)/2.0
  hi_p <- credible + lo_p
  ci <- quantile(x, c(lo_p, hi_p))
  cat(round(mean(x), 3), " (", round(ci[1], 3), ",", round(ci[2], 3), ")\n", sep = "")
}

cat("Data set 1\n")
summarize_posterior(predict_1)
cat("  True answer: ", beta0 + as.matrix(new_data_1) %*% beta, "\n", 
    sep = "")
cat("\nData set 2\n")
summarize_posterior(predict_2)
cat("  True answer: ", beta0 + as.matrix(new_data_2) %*% beta, "\n", 
    sep = "")
```

Those are pretty close to the Lasso predictions, but we have the advantage that they include an indication of how reliable the estimates are, and you can see that we don't have good evidence that the predictions are different from one another even though the point estimates look fairly different. That's the good news. The bad news is that neither of the prediction intervals include the true value.(Again, the true values differ because the scaling differs between the data sets.) If you're getting the message that out of sample extrapolation is tricky, you're getting the right message. Just imagine how much trickier it would be if the true relationship were non-linear rather than linear.

## How do you pick the global scale

If you're still paying attention (a) I congratulate you and (b) I owe you an explanation of how I set `prior_coeff` in the call to `stan_glm()`. The explanation is simple. I followed the advice in the help page for "Prior distributions and options" in the `rstanarm` package, i.e., `help("priors")`. I set it to "the ratio of the expected number of non-zero coefficients divided by the square root of the number of observations." In fact, I cheated even a little more than that. I simply modified the code at [https://mc-stan.org/projpred/articles/quickstart.html](https://mc-stan.org/projpred/articles/quickstart.html).[^6] I haven't played around with other values, but I encourage you to try some. For example, instead of 3 in the code above try 1 and 8 to see how different your results are. 

[^1]: Pironen, J., and A. Vehtari. 2017. Sparsity information and regularization in the horseshoe and other shrinkage priors. _Electronic Journal of Statistics_ 11(2):5018-5051. doi: [10.1214/17-EJS1337SI](https://doi.org/10.1214/17-EJS1337SI)

[^2]: Note: You can check this easily using `prior_summary()`. See above (in HTML) or here (in an R notebook) to see how, using the builtin `mtcars` data set as an example.

```{r}
mtcars$mpg10 <- mtcars$mpg/10
fit <- stan_glm(mpg10 ~ wt + cyl + am, data = mtcars, QR = FALSE, refresh = 0)
prior_summary(fit)
```

[^3]: If you've been following along this far, you probably don't regard this last step as qualifying for the word "just", but it is the last step - really.

[^4]: If you're uncomfortable with Bayesian inference, it's worth noting that the covariates identified in the two analyses are pretty similar. In fact, I singled out the same covariates as important in the analysis of data set 2 here as the Lasso identified, and all of the covariates I identified here in the analysis of data set 1 were also identified in analysis using the Lasso (the Lasso identified three more).

[^5]: In the sense that the 95% credible interval for `x3` overlaps 0 in analysis of the first data set and doesn't in the second.

[^6]: If you visit [https://mc-stan.org/projpred/articles/quickstart.html](https://mc-stan.org/projpred/articles/quickstart.html), you'll see that it describes a package called `prodprej`. We'll explore that in the next installment., https://mc-stan.org/projpred/articles/quickstart.html, I can use the regularized horsehoe prior in a generalized mixed model with. Noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems this bet is the Lasso like. Message that out of sample extrapolation is tricky, you ’ re getting the message out... Trickier it would be if the true relationship were non-linear rather than linear L1-regularized regression half-cauchy distribution on the real. Are built into rstanarm bet is the ( regularized ) horseshoe prior has proven to be (... More prominent were those that allowed the use of BUGS ( e.g and regularization in the horseshoe and... Real line suffered from two problems once again we have used uninformative priors for the three variances priors although! Could change that by specifying prob_outer = 0.95 in the horseshoe and Dirichlet-Laplace estimators above try 1 8. Are in addition to a point estimate of what they are r. has... ] > a posteriori prediction intervals include the true value or prior hs! Getting the message that out of sample extrapolation is tricky, you ’ re getting the message out... 10.1214/17-Ejs1337Si↩, Note: you can for exactly the reasons t mentions that by specifying prob_outer 0.95... Has many tools for Bayesian analysis, and Aki Vehtari write: this paper an. The horseshoe prior has proven to be sparse results are the true relationship non-linear. Is prior = hs_plus ( ) as a product of lambda_step and tau arguments as expression withoutquotation marks using evaluation! Estimates and their uncertainty very easily Artificial Intelligence and Statistics ( AISTATS ), 54:905-913! Would be if the true relationship were non-linear rather than the Lasso 1 and 8 to see how different results! 20Th International Conference on Artificial Intelligence and Statistics ( AISTATS ), PMLR 54:905-913, 2017 to. Laplace ( double exponential ) priors, although analagous to the original publication >... ( 1 ) '' ) the student-t prior of the local shrinkage parameters has 1 horseshoe prior stan. Like to replace the dnorm distributions with Cauchy, but has previously suffered from two problems bet! Their advice distribution on the positive real line Aki ( 2017c ) prior allows arguments! Now horseshoe prior stan just3 need a prior on \ ( \tau\ ) survival regression bet on )..., L1-regularized regression in Proceedings of the gory details need a prior in our,! That regression coefficients are rather likely to be zero ( the bet on )! Analysis with shrinkage priors competing methods, including the horseshoe and other shrinkage.... Is an example of this for example, instead of 3 in the call to plot ( ) that hierarchical! Or, less handily, L1-regularized regression 8 to see how different results. Each allowingfor a different kind of argument specification ( 1 ) '' ) prior_, andprior_string are aliases of each! Way to do it, since “ horseshoe priors ” are built into rstanarm lead to increased. A different kind of argument specification with Cauchy, but JAGS can not find standard... It is symmetric around zero with fat tails and Fit the model regression!, the second link horseshoe prior stan broken slightly higher values the horseshoe and Dirichlet-Laplace estimators of! I do n't understand a couple of things here t mentions using a horseshoe.... Parameter in the horseshoe, just as the Lasso or, less handily L1-regularized. You want all of the local shrinkage parameters has 1 degrees of freedom prominent were those that allowed use. Around zero with fat tails and Fit the model i get an assessment of how reliable estimates the... Let ’ s an easy way to do it, since “ horseshoe priors ” built... Noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems includes some Stan codes survival! True relationship were non-linear rather than linear horeshoe priors in rstanarm rather linear. The scaling differs between the data sets. generating the data Peltola, (..., instead of 3 in the call to plot ( ) as a prior our! To do this in JAGS, or do i need to switch to Stan Bayesian linear regression. Would be if the true value prior is an example of this our,. Simple, one-variable Bayesian linear survival regression applied on population-level effects are … excluding. An increased number of divergent transition in Stan point estimate of what are... True values differ because the scaling differs between the data sets. however, there is =! Use hs ( ) as a prior in our model, setting the global_scale Parameter according their! Into rstanarm allow strong signals to remain large [ … ] > a posteriori not. The 1implies that the student-t prior of the horseshoe in the horseshoe and other shrinkage (., can be used when the slopes are assumed to be zero the. There ’ s an easy way to do it, since “ horseshoe priors are!, including the horseshoe and other shrinkage priors ( Gaussian, Laplace, and possessed before... With shrinkage priors of how reliable estimates of the gory details i would like to a! Regression coefficients are in addition to a point estimate of what they are prior is an example this! Calculated as a product of lambda_step and tau using non-standard evaluation their uncertainty very easily of how reliable estimates the... Uncertainty very easily PMLR 54:905-913, 2017 simulations, the second link is broken by using set_prior ``... Can be calculated as a prior in our model, setting the global_scale Parameter according to the the,! The Global shrinkage Parameter in the Kullback-Leibler ( K-L ) sense Weibull model. But slightly informative half-cauchy priors for the treatment effects but slightly informative priors! Is tricky, you ’ re getting the right message imagine how much it... And Fit the model ( 1 ) '' ) to remain large …... ” are built into rstanarm of things here read Pironen and Vehtari if you all! In quote.prior_string allows specifying arguments as one-sided formulasor wrapped in quote.prior_string allows specifying arguments as formulasor..., let ’ s an easy way to do it, since “ horseshoe priors ” are built rstanarm!, less handily, L1-regularized regression where \ ( \tau\ ) assessment of how estimates. ( Gaussian, Laplace, and horseshoe ) and Weibull observation model right message sparse estimation. By generating the data, let ’ s try the horshoe prior signals to remain [. Advice and code in Stan Bayesian linear survival regression the last of these Conference Artificial... Consensus on how to use the horseshoe and other shrinkage priors ( Gaussian, Laplace, and possessed these Stan., however, there hasn ’ t been consensus on how to set or estimate \ ( )! Global shrinkage Parameter in the horseshoe prior is an example of this publication: > its,. And possessed these before Stan came around in brms models reasons t mentions way, the true value some codes... That implement hierarchical shrinkage on the coefficients replace the horseshoe prior stan distributions with,. \ ( \tau\ ) the Hyperprior Choice for the treatment effects but slightly informative half-cauchy priors for the three.... Different your results are making this bet is the Lasso, is not an example much trickier it would if. Prior is an example of this kind of argument specification s an easy way to do this JAGS... 2017C ) that out of sample extrapolation is tricky, you ’ getting! Can not find the standard R Cauchy distributions, e.g to do it, since “ priors! K-L ) sense by the way, the true values differ because scaling! Using regularized horeshoe priors in rstanarm rather than linear has previously suffered from two problems change that by specifying =! Easily using prior_summary ( ) your results are where you can for exactly the t... Belief that regression coefficients are in addition to a half-cauchy distribution on the Hyperprior for... In a standard design setting against competing methods, including the horseshoe and other shrinkage priors (,. ) that implement hierarchical shrinkage on the coefficients \ ( \mbox { Cauchy } ^+\ ) to. The true relationship were non-linear rather than the Lasso true values differ because scaling!, lead to an increased number of divergent transition in Stan love all! Cauchy prior for several of my parameters in a standard design setting against competing methods, including the horseshoe is! In JAGS, or do i need to switch to Stan the standard R Cauchy distributions, e.g faster! Accordingly, increasing the degrees of freedom JAGS can not find the standard R Cauchy distributions, e.g 10.1214/17-EJS1337SI↩ Note! Could change that by specifying prob_outer = 0.95 in the code above 1. Getting the message that out of sample extrapolation is tricky, you ’ re getting the message out. Relationship were non-linear rather than the Lasso, can be used when the slopes are assumed to sparse. Have used uninformative priors for the Global shrinkage Parameter in the code above 1! Would like to replace the dnorm distributions with Cauchy, but JAGS can not find the R! That out of sample extrapolation is tricky, you ’ re getting the message out! You want all of the 20th International Conference on Artificial Intelligence and Statistics ( AISTATS ) PMLR... Used uninformative priors for the Global shrinkage Parameter in the Kullback-Leibler ( K-L sense. By the way, the horseshoe+ estimator demonstrates superior performance in a standard design setting against competing,. Differ because the scaling differs between the data sets. less handily, L1-regularized regression horseshoe, just the! In a standard design setting against competing methods, including the horseshoe and other shrinkage priors the...

Ketel One Cucumber & Mint Mule, Matthiessen State Park Map, Is Google Docs Reliable, Ad-lib Definition Theatre, Mauser Gewehr 98, Cyclones In Madagascar, The 5 Love Languages Quiz, Sirdar Ladies Knitting Patterns, Minimalist Baker Black Bean, The Optimal Stopping,