## Monday, August 22, 2011

### Bayesian analysis: Comparing algorithms Part 1?

I recently had the opportunity to engage in some Bayesian analysis at work. I was able to state the problem in terms of the lognormal distribution, and took advantage of JAGS and its integration with "R" using the R2jags package. The client was very happy, and life moved on. However, as the underlying problem may be better described by a transformed beta than a lognormal undergoing transforms (say that five times quickly!), it got me to thinking about how to teach myself at least rudimentary Bayesian analysis using MCMC methods.As my main text, I have been using Dr. Scott M. Lynch's excellent text, "Introduction to Applied Bayesian Statistics and Estimation for Social Scientist". Starting small (I haven't worked on the four-parameter distribution yet), I decided to play with data for which I knew the answer, and see if I could write my own Gibbs or Metropolis-Hastings algorithms. I picked a gamma distribution as my test case.

I'm no expert on MCMC methods, so I'm not going to even attempt to explain how this works, but I did see how the different algorithms behave, and it is rather insightful. For the gamma distribution, while I always parameterize it using shape and scale, for the purposes of Gibbs sampling, I used shape/rate. The nice thging about Gibbs sampling is that as long as your can sample from a distribution which is proportional to the actual conditional distribution, you're fine, and do not have to calculate the constants of proportionality that bring the distribution back to an integral of 1.

In the case of the gamma distribution, and using a flat prior, it is the likelihood which dominates the conditional distribution. Using the shape/rate parameterization, $\frac{\beta^\alpha x^{\alpha-1}e^{-\beta x}}{{\Gamma(\alpha)}}$, the likelihood over the observations $x_i$ is:

When looking for a conditional distribution for alpha, this does not resolve into anything remotely nice (although see Fink (1997) "A Compendium of Conjugate Priors" (pdf) for more) so Gibbs sampling will have to give way to a Metropolis algorithm. I'm not sophisticated enough to use a non-symmetrical proposal distribution, so it is not a Metropolis-Hastings algorithm, maybe in a future post. However, when looking at beta, and discarding the x's and alphas, the core of the likelihood is rather gamma-like. Actually, it is proportional to a gamma distribution with the rate equal to the sum of the observed data points and the shape equal to one more than the product of the number of observations and the underlying alpha.

With that observation, and properly writing the algorithm (which took more time than I care to admit), a comparison between using Gibbs sampling or straight Metropolis on the scale parameter can be made.

Here is a "pure" Metropolis algorithm to calculate the posterior distribution, assuming the input data is gamma (this and all "pretty" R code created by Pretty R at inside-R.org :

MHGam <- function (x, n.iter=10000, n.burnin=0){
alpha <- matrix(.5, n.iter)
theta <- matrix(10, n.iter)
acca<- 0
accq <- 0
lnpost <- function(a, q, z){
NLL <- (a-1)*log(z)-(z/q)-a*log(q)-lgamma(a)
return(sum(NLL))
}
for (i in 2:n.iter)
{
#sample theta
acc <- 1
theta[i] <- theta[i-1] + rnorm(1, mean=0, sd=100)
if (theta[i]<=0) {acc=0; theta[i] <- theta[i-1]}
LL <- lnpost(alpha[i-1], theta[i], x)
if ((LL - lnpost(alpha[i-1], theta[i-1], x)) < log(runif(1))) {
theta[i] <- theta[i-1]
acc <- 0
}
accq <- accq + acc
#sample alpha
acc <- 1
alpha[i] <- alpha[i-1] + runif(1, min=-.5, max=.5)
if (alpha[i]<=0) {acc=0; alpha[i] <- alpha[i-1]}
if ((lnpost(alpha[i], theta[i], x) - LL) < log(runif(1))) {
alpha[i] <- alpha[i-1]
acc <- 0}
acca <- acca+acc
if (i%%100==0) {print(c(i, alpha[i], theta[i], acca/i, accq/i))}
}
amean <- mean(alpha[n.burnin+1:n.iter])
qmean<-mean(theta[n.burnin+1:n.iter])
Fit <- structure(list(alpha=alpha, theta=theta, amean=amean, qmean=qmean))
return(Fit)
}

The above code will just hand off the parameter sampling between alpha and theta (which is the scale and not rate) and will, hopefully, eventually settle down. I added two extra parameters for iterations and burnin, and eventually, I could add another, outer, for-next loop for multiple chains.

The next code takes advantage of Gibbs sampling for the scale parameter:

MHGamGib <- function (x, n.iter=10000, n.burnin=0){
alpha <- matrix(.5, n.iter)
theta <- matrix(10, n.iter)
acca<- 0
lnpost <- function(a, q, z){
NLL <- (a-1)*log(z)-(z/q)-a*log(q)-lgamma(a)
return(sum(NLL))
}
for (i in 2:n.iter)
{
#sample theta
theta[i] <- 1/rgamma(1, rate=sum(Test), shape=alpha[i-1]*length(x)+1)
#sample alpha
acc <- 1
alpha[i] <- alpha[i-1] + runif(1, min=-.5, max=.5)
if (alpha[i]<=0) {acc=0; alpha[i] <- alpha[i-1]}
if ((lnpost(alpha[i], theta[i], x) - lnpost(alpha[i-1], theta[i], x)) < log(runif(1))) {
alpha[i] <- alpha[i-1]
acc <- 0}
acca <- acca+acc
if (i%%100==0) {print(c(i, alpha[i], theta[i], acca/i))}
}
amean <- mean(alpha[n.burnin+1:n.iter])
qmean<-mean(theta[n.burnin+1:n.iter])
Fit <- structure(list(alpha=alpha, theta=theta, amean=amean, qmean=qmean))
return(Fit)
}

Note that I'm still parameterizing using the scale by sampling the rate and using its reciprocal.

To set up, run, and compare the models, I used the following code:

set.seed(22)

Test <- rgamma(100, shape=2.5, scale=1e5)

MHT<-system.time(MH<-MHGam(Test, n.iter=50000))

MHGT<-system.time(MHG<-MHGamGib(Test, n.iter=50000))

The Gibbs version runs about 40% faster, comparing MHT with MHGT (at least on my laptop), which stands to reason as Gibbs sampling is Metropolis sampling where every variate is accepted. What is more fascinating is how fast the parameter values converge:

Using the following code to overlay the trace plots for theta:

plot(MHG$theta, type="l", col="red") points(MH$theta, type="l", col="blue")

We see:
The Gibbs sampling (in red) gets to the right neighborhood much more quickly for theta than the Metropolis (blue), and gives a more robust (less serially correlated) answer.

What is also surprising is how the better estimates for theta affect the random-walk Metropolis algorithm for alpha. Using the following code:

plot(MH$alpha, type="l", col="blue") points(MHG$alpha, type="l", col="red")

We get the following trace:
Personally, I found that, writing the algorithms and seeing the direct results helped me understand the concepts better than just reading, and I'll certainly try to use Gibbs sampling where I can!

1. The Metropolis algorithm isn't performing well here because the joint posterior for alpha and theta is highly correlated (plot alpha against theta to see this). Your proposal distribution is proposing moves for alpha and theta that are independent of each other, but it would be better to propose moves in a correlated direction along the high probability surface.

In this case, it's probably better to just do Gibbs, but in cases where you can't write down the necessary exact conditional distributions, you can try an adaptive Metropolis approach. Compute the covariance of the poorly converged chain and use it as the proposal distribution to compute second, better converged chain. For example, once you've computed alpha and theta as above, you can compute the covariance of this "preliminary" chain:

precov = cov(cbind(alpha,theta))

Then run the Metropolis algorithm again. Except this time, in your proposal distribution, propose a new alpha and a new theta at the same time, drawn from a (correlated) bivariate normal distribution centered on the existing point:

new.point = rmvnorm(1, mean=c(alpha[i],theta[i]), Sigma=precov)
alpha[i+1] = new.point[1]; theta[i+1] = new.point[2]

(The 'rmvnorm' function may be found in the 'mvtnorm' package, or you can implement multivariate normal sampling yourself.)

It's not always optimal to have the proposal distribution be identical to the covariance of the target posterior, so you may have to scale the 'precov' matrix (e.g., multiply it by 0.5) to get a good acceptance rate (around 25%).

This won't work perfectly, since a multivariate normal proposal distribution doesn't look like the actual posterior (which is curved in alpha-theta space), but it will work much better than independent proposals. (I tried it on your problem and it looks a lot closer to the Gibbs output.)

See Roberts and Rosenthal, "Examples of adaptive MCMC" for more advanced methods.

Also, I know this was a teaching example for yourself, but you can get faster performance by using an R package for MCMC like the 'metrop' function in the 'mcmc' package.

2. Actually, the simplest thing you could do to improve performance is just to make the theta step size much bigger. Try sd=10000 instead of sd=100. You can see from the Gibbs plot that the spread of theta is of order 10^4. (Again, in an adaptive sense, you could compute the standard deviation of the thetas you got in the first poorly converged Metropolis run, and use that as the standard deviation of your proposal distribution.)

3. Thank you very much, Dr. Urban. I will definitely try your suggestions! I agree, the existing packages for R will work much better than my home-brewed code, but I do feel that I have a somewhat better understanding of the how the algorithms are supposed to work now that I have made some attempt at writing them from scratch.

In your research, do you use the existing packages, or have you written custom-tailored code for your use?

Thank you again!

4. I usually use the metrop() function I mentioned above, but for some special applications I've written my own samplers (e.g., to sample different groups of variables in different blocks, each with their own proposal distribution). I've also worked with more advanced Monte Carlo algorithms (like Hamiltonian Monte Carlo, which uses gradient information to speed up convergence ... assuming that it's easy and fast to compute partial derivatives of your posterior).

By the way, for future reference, the chains you plotted for the Metropolis algorithm are symptomatic of either a too-small step size or a correlated posterior or both.

What you see in the traces is that the algorithm is taking very small steps to gradually approach a good region of probability space. That either means the step size is too small and can be simply increased, or else the posterior distribution is so correlated that with independent steps you have to take small steps.

The two solutions I mentioned are the typical remedies: either increase the step size, or take correlated steps. If there are a lot of correlated variables some people like to break them into small blocks of related parameters (as I mentioned above) and try to move them separately, so a failure to move in one block won't affect the parameters in other blocks.

By contrast, the graphical diagnostic of a step size that's too large is a chain that looks very "stepwise": you get lots of rejections, so the chain stays in the same place for a long time before finally moving.

One rule of thumb is to adjust the step sizes until you get your acceptance rate to about 20% or so. (This isn't a hard rule, and the effective sample size is a better measure of how much information is in your chain.) If you have a small acceptance rate, take smaller steps; if you have a high acceptance rate, take bigger steps. In your original example, I think it was producing around a 70% acceptance rate, suggesting that you could take bigger steps in at least one of the variables.

5. If you're trying to learn Bayesian data analysis using MCMC methods, I can happily recommend a book. (I happen to be the author of the book :-) but apparently other people seem to like it too.) Here are links to the book's blog and web page:
http://doingbayesiandataanalysis.blogspot.com/
http://www.indiana.edu/%7Ekruschke/DoingBayesianDataAnalysis/
Happy Bayesian analyzing!