Monday, August 22, 2011

Bayesian analysis: Comparing algorithms Part 1?

I recently had the opportunity to engage in some Bayesian analysis at work. I was able to state the problem in terms of the lognormal distribution, and took advantage of JAGS and its integration with "R" using the R2jags package. The client was very happy, and life moved on. However, as the underlying problem may be better described by a transformed beta than a lognormal undergoing transforms (say that five times quickly!), it got me to thinking about how to teach myself at least rudimentary Bayesian analysis using MCMC methods.As my main text, I have been using Dr. Scott M. Lynch's excellent text, "Introduction to Applied Bayesian Statistics and Estimation for Social Scientist". Starting small (I haven't worked on the four-parameter distribution yet), I decided to play with data for which I knew the answer, and see if I could write my own Gibbs or Metropolis-Hastings algorithms. I picked a gamma distribution as my test case.


I'm no expert on MCMC methods, so I'm not going to even attempt to explain how this works, but I did see how the different algorithms behave, and it is rather insightful. For the gamma distribution, while I always parameterize it using shape and scale, for the purposes of Gibbs sampling, I used shape/rate. The nice thging about Gibbs sampling is that as long as your can sample from a distribution which is proportional to the actual conditional distribution, you're fine, and do not have to calculate the constants of proportionality that bring the distribution back to an integral of 1.


In the case of the gamma distribution, and using a flat prior, it is the likelihood which dominates the conditional distribution. Using the shape/rate parameterization, , the likelihood over the observations is:


When looking for a conditional distribution for alpha, this does not resolve into anything remotely nice (although see Fink (1997) "A Compendium of Conjugate Priors" (pdf) for more) so Gibbs sampling will have to give way to a Metropolis algorithm. I'm not sophisticated enough to use a non-symmetrical proposal distribution, so it is not a Metropolis-Hastings algorithm, maybe in a future post. However, when looking at beta, and discarding the x's and alphas, the core of the likelihood is rather gamma-like. Actually, it is proportional to a gamma distribution with the rate equal to the sum of the observed data points and the shape equal to one more than the product of the number of observations and the underlying alpha.

With that observation, and properly writing the algorithm (which took more time than I care to admit), a comparison between using Gibbs sampling or straight Metropolis on the scale parameter can be made.

Here is a "pure" Metropolis algorithm to calculate the posterior distribution, assuming the input data is gamma (this and all "pretty" R code created by Pretty R at inside-R.org :

MHGam <- function (x, n.iter=10000, n.burnin=0){
alpha <- matrix(.5, n.iter)
theta <- matrix(10, n.iter)
acca<- 0
accq <- 0
lnpost <- function(a, q, z){
  NLL <- (a-1)*log(z)-(z/q)-a*log(q)-lgamma(a)
  return(sum(NLL))
}
for (i in 2:n.iter)
{
#sample theta
  acc <- 1
  theta[i] <- theta[i-1] + rnorm(1, mean=0, sd=100)
  if (theta[i]<=0) {acc=0; theta[i] <- theta[i-1]}
  LL <- lnpost(alpha[i-1], theta[i], x)
  if ((LL - lnpost(alpha[i-1], theta[i-1], x)) < log(runif(1))) {
    theta[i] <- theta[i-1]
    acc <- 0
  }
  accq <- accq + acc
#sample alpha
    acc <- 1
    alpha[i] <- alpha[i-1] + runif(1, min=-.5, max=.5)
    if (alpha[i]<=0) {acc=0; alpha[i] <- alpha[i-1]}
    if ((lnpost(alpha[i], theta[i], x) - LL) < log(runif(1))) {
      alpha[i] <- alpha[i-1]
      acc <- 0}
    acca <- acca+acc
    if (i%%100==0) {print(c(i, alpha[i], theta[i], acca/i, accq/i))}
    }
amean <- mean(alpha[n.burnin+1:n.iter])
qmean<-mean(theta[n.burnin+1:n.iter])
Fit <- structure(list(alpha=alpha, theta=theta, amean=amean, qmean=qmean))
return(Fit)
}

The above code will just hand off the parameter sampling between alpha and theta (which is the scale and not rate) and will, hopefully, eventually settle down. I added two extra parameters for iterations and burnin, and eventually, I could add another, outer, for-next loop for multiple chains.


The next code takes advantage of Gibbs sampling for the scale parameter:

MHGamGib <- function (x, n.iter=10000, n.burnin=0){
alpha <- matrix(.5, n.iter)
theta <- matrix(10, n.iter)
acca<- 0
lnpost <- function(a, q, z){
  NLL <- (a-1)*log(z)-(z/q)-a*log(q)-lgamma(a)
  return(sum(NLL))
}
for (i in 2:n.iter)
{
#sample theta
  theta[i] <- 1/rgamma(1, rate=sum(Test), shape=alpha[i-1]*length(x)+1)
#sample alpha
    acc <- 1
    alpha[i] <- alpha[i-1] + runif(1, min=-.5, max=.5)
    if (alpha[i]<=0) {acc=0; alpha[i] <- alpha[i-1]}
    if ((lnpost(alpha[i], theta[i], x) - lnpost(alpha[i-1], theta[i], x)) < log(runif(1))) {
      alpha[i] <- alpha[i-1]
      acc <- 0}
    acca <- acca+acc
    if (i%%100==0) {print(c(i, alpha[i], theta[i], acca/i))}
    }
amean <- mean(alpha[n.burnin+1:n.iter])
qmean<-mean(theta[n.burnin+1:n.iter])
Fit <- structure(list(alpha=alpha, theta=theta, amean=amean, qmean=qmean))
return(Fit)
}

Note that I'm still parameterizing using the scale by sampling the rate and using its reciprocal.

To set up, run, and compare the models, I used the following code:

set.seed(22)
 
Test <- rgamma(100, shape=2.5, scale=1e5)
 
MHT<-system.time(MH<-MHGam(Test, n.iter=50000))
 
MHGT<-system.time(MHG<-MHGamGib(Test, n.iter=50000))

The Gibbs version runs about 40% faster, comparing MHT with MHGT (at least on my laptop), which stands to reason as Gibbs sampling is Metropolis sampling where every variate is accepted. What is more fascinating is how fast the parameter values converge:

Using the following code to overlay the trace plots for theta:

plot(MHG$theta, type="l", col="red")
points(MH$theta, type="l", col="blue")


We see:
The Gibbs sampling (in red) gets to the right neighborhood much more quickly for theta than the Metropolis (blue), and gives a more robust (less serially correlated) answer.

What is also surprising is how the better estimates for theta affect the random-walk Metropolis algorithm for alpha. Using the following code:

plot(MH$alpha, type="l", col="blue")
points(MHG$alpha, type="l", col="red")

We get the following trace:
Personally, I found that, writing the algorithms and seeing the direct results helped me understand the concepts better than just reading, and I'll certainly try to use Gibbs sampling where I can!