Selection function in STAN

Blog

I’m currently working on a problem where I need to sample from a very specific custom pdf. It has taken me a lot of time to get this pdf written down but now that I finally have it, I’m just missing one key ingredient… the selection function! This is when you have some underlying data (lets say some images of stars), but the observation of that data has been truncated (for example limitations of the telescope mean we can only see stars brighter than 25 mag) and then scattered by some noise. The selection function is the truncation, it is how your sample is selected from the underlying population. Generally the selection is unknown so the best way to tackle it is to fit for it.

It took me a really long time (2 days!) to sit down and figure out how to do a selection function in STAN. The problem was that I couldn’t really find any working examples on the ‘interwebs‘ but turns out that its actually really easy and really fast.

Here I demonstrate with some easy examples.

Case 1: Simple Gaussian.

First we need to generate some toy data.

nx = 1000 #number of data points in population
mu = 7 
sig = 1.5
x = rnorm(nx, mu, sig) #underlying population values

cut = 6
xcut = x[which(x > cut)] #sample values after selection
n = length(xcut)

dx = 0.2 #uncertainty on x
xobs = xcut + rnorm(n, 0,dx ) #observed values

#make kernel density plot of the distributions
plot(density(x), ylim=c(0,0.4), main='', col='red', xlab='x', ylab='unnormalised density')
abline(v=cut, col=rgb(0,0,0,0.3))
lines(density(xcut), lty='dotted', col='red')
lines(density(xobs), lty='dashed')
legend('topleft', legend=c('population','sample', 'observed sample'), col=c('red', 'red', 'black'), lty=c('solid','dotted','dashed'))

dist_gaus

The STAN model should look something like this:

mymodel <-"
data {
 int n;
 real xobs[n]; //observed x
 real dx;
}

transformed data {
 // real xcut = 6.0; //use for a fixed selection... generally selection is unknown!
}

parameters {
 real xcut; //selection value
 real mu;
 real sigma;
 real xtrue[n]; //true sample x
}

model {
 xcut ~ normal(6,1);
 for(i in 1:n){
  /* we truncate a lower bound at xcut using T[lower,upper]. This already includes normalisation. Also note with truncation we need to do sampling in a for loop! */
  xtrue[i] ~ normal(mu, sigma) T[xcut,];
 }
 xobs ~ normal(xtrue, dx);
}

"

We then run the code, with sensible initialisation. Pro-tip: make sure to initiate xtrue above the selection cut!

 nchains = 3 #number of chains

#initialisation
 init1 <- lapply(1:nchains, function(x) list(mu=rnorm(1, 7,1), sigma=rnorm(1,1.5,0.2), xtrue=rnorm(n, 8,0.1), xcut=6.0))

fit = stan(model_code = mymodel, data=list(n=n, xobs=xobs, dx=dx),
 init=init1, chains=nchains, iter=2000,
 control=list(adapt_delta=0.8))

You should get something that looks like this:

print(fit, pars=c('xcut','mu','sigma') )

Inference for Stan model: 4eebe5b93a44437492c76c544c69646a.
3 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=3000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
xcut 6.01 0.00 0.04 5.93 5.98 6.01 6.04 6.08 110 1.03
mu 6.83 0.01 0.22 6.36 6.71 6.86 6.99 7.17 314 1.01
sigma 1.61 0.00 0.11 1.42 1.53 1.60 1.67 1.85 488 1.01


Samples were drawn using NUTS(diag_e) at Wed Jan 24 10:02:33 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

pairs(fit, pars=c('mu','sigma'))

post
We can also make some posterior predictive checks, here we show some data generated from a gaussian of the fitted mean and standard deviation (black), as you can see they agree really well with the population.

params=extract(fit)
xfit=colMeans(params$xtrue)
mufit = mean(params$mu)
sigfit = mean(params$sigma)

plot(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted', main='', xlab='x', ylim=c(0,0.4))
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(x), col=rgb(1,0,0,0.5))
lines(density(xobs),col=rgb(1,0,0,0.5), lty='dashed')
legend('topleft', legend=c('true', 'observed', 'predicted'), col=c('red', 'red', 'black'), lty=c('solid','dashed','dotted') )

post_pred

Case 2: Custom distribution

This is where things get a bit more tricky, but still isn’t a challenge for STAN. Before we jump into the coding, we should discuss the boring part (the math!). When we sample from a truncated pdf what we really are doing is sampling from,

x \sim P_{[a,b]} (x) = \frac{P(x)}{\int_a^b P(u) du} .

A lower selection limit then looks like,

x \sim P_{[xcut,\infty]} (x) = \frac{P(x)}{\int_{xcut}^\infty P(u) du} .

Stan can’t do integrals numerically but it can be done by solving the ODE. Unfortunately the denominator in this is an improper integral and STAN doesn’t deal well with improper integrals so we can re-parameterise the integral in the following way:

\int_{xcut}^\infty P(x) dx = \int_0^1 P\left(xcut + \frac{x}{(1-x)} \right)/(1-x)^2 dx.

Now the next problem is that this integral is numerically unstable. At x=1, the integrand is infinite. To solve this we can just do a set the upper limit to a number that is a little bit smaller than 1, e.g. 0.9999.

Okay so now with the theory out of the way, we can move onto the code. We first generate again some toy data from a custom distribution:

n = 2000 #population size
x = c(rnorm(n/2, 4, 1.5), rnorm(n/2, 7,1)) #x is drawn from a mixture model
cut = 3
xsam = x[which(x > cut)]
ns = length(xsam) #sample size after selection

dx = 0.5 #measurement uncertainty
xobs = xsam + rnorm(ns, 0, dx) #observed sample

plot(density(x), ylim=c(0,0.3), main='', xlab='x', ylab='unnormalised density', col='red')
lines(density(xsam), col='red', lty='dotted')
lines(density(xobs), lty='dashed')
abline(v=cut, col=rgb(0,0,0,0.3))
legend('topleft', legend=c('population','sample', 'observed sample'), col=c('red', 'red', 'black'), lty=c('solid','dotted','dashed'))

toy

We define the STAN model as follows:

custmodel<-"
 functions {
 real custom(real y, real mu1, real mu2, real sigma1, real sigma2){
  //custom pdf
  return 0.5*exp(- square(mu1- y) / (2 * sigma1^2) )/(sigma1*sqrt(2*pi())) +
 0.5*exp(- square(mu2 - y) / (2 * sigma2^2) )/(sigma2*sqrt(2*pi())) ;
 }

real[] N_integrand(real y, real[] state, real[] params, real[] x_r, int[] x_i) {
 //ode to solve
 real mu1 = params[1];
 real mu2 = params[2];
 real sigma1 = params[3];
 real sigma2 = params[4];
 real xcut = params[5];
 real dxdy[1];
 real ynew = xcut + y/(1-y);

 dxdy[1] = custom(ynew, mu1, mu2, sigma1, sigma2)/square(1-y);

 return dxdy;
 }
}

data {
 int n;
 real xobs[n];
 real dx;
}

parameters{
 real xcut;
 ordered[2] means; //ordered because mixture models have identifiability issues
 real sigs[2];
 real xtrue[n];
}

model{
 real norm;
 real theta[5]={means[1], means[2], sigs[1], sigs[2], xcut};

 xcut ~ normal(4.0,1.0);

/* We have to re-parameterise the improper integral int_xcut^inf p(x) dx to int_0^1 p(xcut + x/(1-x))/(1-x)^2 dx */
 norm = integrate_ode_rk45(N_integrand, rep_array(0.0,1), 0.0, rep_array(0.9999,1),
 theta, rep_array(0.0,0),rep_array(0,0))[1,1];

 for(i in 1:n){
  target += log(custom(xtrue[i], means[1], means[2], sigs[1],sigs[2]));
  if(xtrue[i] < xcut){
   target+=negative_infinity();
  }else{
   target+=-log(norm); //normalise pdf
  }
 }
 xobs ~ normal(xtrue, dx);

}

"

Note that the custom distribution is a gaussian mixture model. Recall in a previous post (Multivariate Gaussian Mixture Model done properly) I discussed the problems with mixture models and their identifiability issues. For this example I only run 1 very long chain for the demonstration since the Bayesian Fraction of Missing Information, I needed a lot more iterations and warm-up to achieve a good effective sample size. Here’s the results:

nchains = 1
init1 <- lapply(1:nchains, function(x) list(means=c(4.0,7.0), sigs=c(1,1), xtrue=rep(6.0,ns), xcut=3.0))

fit = stan(model_code = custmodel, data=list(n=ns, x=xobs, dx=dx), init=init1, chains=nchains, iter=5000, control=list(adapt_delta=0.8))
print(fit, pars=c('means','sigs','xcut'))

Inference for Stan model: 0ff67e36d860e76473b011bccd66c477.
1 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=2500.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
means[1] 3.69 0.04 0.42 2.82 3.41 3.71 4.00 4.43 112 1
means[2] 6.88 0.01 0.11 6.65 6.81 6.89 6.96 7.08 156 1
sigs[1] 1.42 0.00 0.13 1.20 1.33 1.41 1.49 1.71 684 1
sigs[2] 1.04 0.00 0.07 0.93 1.00 1.04 1.09 1.18 246 1
xcut 3.06 0.01 0.09 2.86 3.02 3.07 3.13 3.21 61 1

Samples were drawn using NUTS(diag_e) at Thu Jan 25 11:08:37 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

pairs(fit, pars=c('means','sigs', 'xcut'))

post_cust

# posterior predictive checks
params=extract(fit)
means = colMeans(params$means)
sigs = colMeans(params$sigs)

plot(density(xobs), main='', xlab='x', ylab='unnormalised density', ylim=c(0,0.4), lty='dashed', col=rgb(1,0,0,0.5))
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(x), col=rgb(1,0,0,0.5))
abline(v=cut, lty='dashed',col='red')

legend('topleft', legend=c('true', 'observed', 'predicted'), col=c('red', 'red', 'black'), lty=c('solid','dashed','dotted') )

post_pred_cust

Overall good agreement. So yeh. Selection function in STAN? Easy peasy!. Also see https://github.com/farr/SelectionExample and https://github.com/farr/GWDataAnalysisSummerSchool/blob/master/lecture3/lecture3.ipynb for some more examples!

My city Coventry

Blog

Okay so a few weeks ago at work a colleague asked me where I was from in the UK, and I replied
“Oh, it’s a small little town, you’ve probably never heard of it, Coventry. It’s close to Birmingham”.
This is a natural response for me given that I lived in the US for a year and the only city anyone had heard of is London.

Actually the worse conversation I ever had went something along the lines of:

Her: “You’re from England, wow you’re english is so good!”
Me *confused*: “Erm, yeh because English people tend to speak english… ”
Her: “Oh really, I thought you spoke french there”
Me: *facepalm*

Anyway. So it turned out that my colleague had indeed heard of coventry since there are at least 2 other ESA employees working here in Spain who are from Coventry.

Statistically wise this is a super amazing opportunity for me to calculate the probability of that actually happening!

So lets say there are ~400 employees at ESA, Spain.

Since employment is swayed by the contribution of ESA member states and the UK’s slice is ~8%, we expect there to be:

0.08 \times 400 = 32 employees at ESA from the UK.

The probability of selecting someone from coventry out of the whole of the UK is given by the ratio of their populations:

\frac{\rm Coventry's\ population}{\rm UK\ population} = \frac{3.2\times 10^5}{6.6\times 10^7} = 0.005

So we would expect

0.005 \times 32 = 0.16 employees from Coventry working at ESA spain!

Since we have at least 3 of them, the likelihood of us all being here is much smaller! Note also that we all come from all over Coventry and went to completely different schools, so its unlikely that there could be some sort of bias. None the less, good job Coventry on producing some great scientists!

Astronaut Andreas Mogensen

Blog

Earlier this week I visited ESA’s European Astronaut Centre (EAC) in Cologne, Germany. I’ve been very fortunate for this to be the 3rd time I have been to facility. The last time was pretty cool – I mean, it’s not every day you get a personal tour by ESA astronaut Andreas Mogensen, especially whilst he’s preparing to go to space!

Well last week, Andreas Mogensen finally made his debut space flight to the international space station. For those of you who don’t know much about Andreas, here’s 5 facts that you really ought to know:

1. He was joined the European Astronaut Corps in 2009 and has been training for 6 years for his 10 day journey aboard the ISS

2. He is the 1st astronaut from Denmark and to celebrate the Danish company LEGO made 20 custom LEGO figurines to keep him company whilst in space. The LEGO toys will be prizes to kids that can come up with the best video of Andreas’ story

3. Originally his trip to the ISS was supposed to take 6 hours but instead it took 2 days to avoid space junk, that’s a long time to be stuck in the Soyuz – It’s a 5th the size of the shuttle orbiter!

4. When Andreas and fellow crew members boarded the ISS, they brought the total number of inhabitants up to 9. The ISS was only built for 6 astronauts!

5. His main task is of course science. These include testing a new water-cleaning system, hands-free goggles similar to google-glass, a tight-fitting suit that mimics the effects of gravity and controlling rovers on Earth to prepare for future missions on Mars.

At EAC, I was given the opportunity to sit in on a conference call with the ISS and space agencies from around the world. It was unbelievable that I was on a LIVE chat to all 9 astronauts in space! Space has never seemed more close than in that moment and it is a memory that I will cherish always. Follow Andreas’ journey on twitter @Astro_Andreas.

Quarantine in Baikonur

3 nights as an astronomer

Blog
Hola

Welcome to Chile!

 

I had the rare opportunity to go observing on the 4m Blanco telescope on Cerro Tololo, Chile. The telescope itself was built in the 70’s so its mind blowing that it is still functional today. This telescope like many other similar oldies are constantly given new life with the instalments of new instruments. Right now, that would be DECam, a CCD imager who’s scientific purpose is to understand the cosmic acceleration.

I took the evening shuttle up mountain, with wild horses, goats and cacti of all shapes and sizes making appearances along the way. It is amazing the infrastructure they have in place here. The roads carved into the endless ripples of dry, desert hills. The journey took about 1.5hrs from the local town La Serena but would’ve taken a lot longer on donkey-back as they did back in the day. The facility on the mountain was impressive, it was fully equipped with internet, water and electricity and the meals were delicious. Every night whilst observing, the cooks would prepare us a night lunch – it was definitely needed. It’s currently winter in Chile which means the nights are long.

inversion

In between the hills you can clearly see the fog that is the inversion layer.

twilight

Evening twilight on Cerro Tololo, both observatories shown aren’t the ones I worked in but were just as impressive.

 

 

 

 

 

 

 

 

 

Just before night-fall, I would walk the path up to the observatory. The observatory is huge! i could actually see it, miles away whilst driving up the mountain. A shining orb of the knowledge to come, the dome is made of a reflective material in order to maintain the cool temperature within. At 7.30pm, I would head outside to watch the sunset. It was beautiful every time. The hills below were endless and engulfed within a hazy fog they call the inversion layer. If the inversion layer were not present, we would be able to see the south pacific ocean, however it would also mean bad weather to come. As the sunset, the snow-topped mountains would light up pink, twilight begins. Back in the control room we had to wait until the sun was 14 degrees below the horizon before we could begin. Once it did however, it was non-stop until 7.30am the next morning! Thankfully the observatory was fully equipped with a kitchen and toilets and the main thing was that it was warm!

Each night, I would take a break and go outside. It was always pitch black, which was quite worrying since I’m sure countless individuals have blindly stepped off the mountain to their deaths. The skies were unbelievably clear whilst I was observing, and it was easy to see the milky way. In deserted locations like these, where the light-polution from civilisation is non-existent, there are always countless more stars than the eye can see. Just before morning twilight, I observed for the first time the zodiacal light – a hazy glow in the east caused by scattered sunlight off interplanetary debris.

Blanco

Dr Chris Lidman and I and the 4m Victor Blanco Telescope

For my observing run, I was accompanying Dr. Chris Lidman, a staff scientist of the Australian Astronomical Observatory (AAO). This gave me the opportunity to tour the telescope itself! Up 5 levels in an elevator is where the telescope lives. It was gigantic! In fact below the telescope there was a net to catch people in case they fell off! Mounted on the sides of the dome were 2 giant white screens that would be projected by LED lights for flat fielding. The floor below was the aluminizing chamber where every 2 years the mirror is dropped down and slid into. Here the old aluminium surface is removed using hydrochloric acid and a new aluminium layer is deposited. The whole process takes 6 days! The telescope is also regularly cleaned by CO2 and is carried out by walking onto the mirror itself! Another interesting thing I found was the cryogenic pump. This pump is used to maintain DECam’s temperature and is constantly roaring day and night. It kinda sounds like the Tardis. This amazing piece of equipment needs maintenance every 7 months, costs $35,000 and will put the observatory out of use for a week.

I was lucky enough to get 3 clear nights of observing in total and am very sad to leave. However I really need to catch up on some sleep now. Adios Chile.