Will there be a last man on the moon?

Blog

I was born in 1990. Mankind first stepped on the moon on July 20th 1969. We did it again in November of the same year, again in February and July of 1971, and in April and December of the following year.

In total there have been 12 men to have set foot on another celestial body. I have never known a day where I have not been sharing the Earth with a lunar astronaut. But today 8 of those 12 heroes have passed away. The 4 remaining, Buzz Aldrin, David Scott, Charles Duke and Harrison Schmitt are aged 89, 87, 83 and 84 respectively.

We haven’t been back to the moon since 1972. NASA have a target to get another human there with the Artemis mission by 2024, but is this already too late?

So my question to you is Is it likely that we will see another human on the moon before all our lunar astronauts go extinct? or will it soon be the day of the last man on the moon?

We can use statistics to help us infer the odds.

Firstly let’s take mortality predictions from the US social security actuarial life table of 2016.

data = read.table('life_exp.txt',header=TRUE)

You can download the data here.

Next we calculate the probability that an astronaut will die within n years, given their age. This is given by,

P(death in N years | current age) =
P(death in year N | age in year N) \times P(alive in year N-1 | current age) + P(death in year N-1 | current age)

pdeadn = function(startyr = 2019, endyr = 2020, birthyr){
  age = startyr - birthyr
  nyrs = endyr-startyr+1 #add one because you count the first year as 2016
  cat('number of years:', nyrs, '\n')
  
  pdead=c()
  pdead[1] = data$pdeath_male[which(data$age == age)]
  for(i in 2:nyrs){
    age = age+1
    pdead[i] = (1.0 - pdead[i - 1])*data$pdeath_male[which(data$age == age)] + pdead[i-1]
  }
  
  cat('final age:', age,'\n')
  return(pdead)
}

We apply this to our 4 alive astronauts, who were born in 1930, 1932, 1936 and 1935, to get the probability of their deaths up until 2040.

startyr = 2019 #start at current year since we know they are alive
endyr   = 2040
birthyr = c(1930, 1932, 1936, 1935)
pdead   = sapply(birthyr, function(x) pdeadn(startyr=startyr, endyr=endyr, birthyr=x)) #probability of death

Assuming that the deaths of any of the astronauts are independent events, the probability that all the lunar astronauts are dead is given by the multiplication rule,

P(all astronauts dead) = P(Aldrin dead) \times P(Scott dead) \times P(Duke dead) \times P(Schmitt dead)

Applying this we can obtain the following plot of the probability that all astronauts are dead as a function of time,

yrs      = seq(2019, 2040)
pdeadall = apply(pdead, 1, prod)
plot(yrs, pdeadall, ty = 'l', tck = 0.02, xlab = 'year', ylab = 'P(all lunar astronauts dead)', xlim = c(2019, 2040))
axis(side = 3, tck = 0.02, labels = FALSE)
axis(side = 4, tck = 0.02, labels = FALSE)
abline(v = 2024, lty = 'dashed', col = 'orange', lwd = 3)

0000d7

The orange line shows the goal launch date of NASA’s Artemis program to get the first woman and next man on the moon. From this we can see that there is a 12% probability that all the Apollo Astronauts will not see the next landing on the moon. However, experience shows that space launches are never delivered on time. The following plot shows the predicted launch date of NASA’s Hubble Space Telescope as a function of time.

hubble_launch_date

Hubble launched in 1990, 2.3 times the original scheduled launch time. If the Artemis program faces similar delays, we won’t see a moon landing until 2031, by which case the probability that we will go back on progress, and that I will see a day where no man has stepped on the moon will have increased to 78%, and by 2034 (the set goal of the Chinese to land on the moon) almost certainly (93%) this will be the case 😦

Selection function in STAN

Blog

I’m currently working on a problem where I need to sample from a very specific custom pdf. It has taken me a lot of time to get this pdf written down but now that I finally have it, I’m just missing one key ingredient… the selection function! This is when you have some underlying data (lets say some images of stars), but the observation of that data has been truncated (for example limitations of the telescope mean we can only see stars brighter than 25 mag) and then scattered by some noise. The selection function is the truncation, it is how your sample is selected from the underlying population. Generally the selection is unknown so the best way to tackle it is to fit for it.

It took me a really long time (2 days!) to sit down and figure out how to do a selection function in STAN. The problem was that I couldn’t really find any working examples on the ‘interwebs‘ but turns out that its actually really easy and really fast.

Here I demonstrate with some easy examples.

Case 1: Simple Gaussian.

First we need to generate some toy data.

nx = 1000 #number of data points in population
mu = 7 
sig = 1.5
x = rnorm(nx, mu, sig) #underlying population values

cut = 6
xcut = x[which(x > cut)] #sample values after selection
n = length(xcut)

dx = 0.2 #uncertainty on x
xobs = xcut + rnorm(n, 0,dx ) #observed values

#make kernel density plot of the distributions
plot(density(x), ylim=c(0,0.4), main='', col='red', xlab='x', ylab='unnormalised density')
abline(v=cut, col=rgb(0,0,0,0.3))
lines(density(xcut), lty='dotted', col='red')
lines(density(xobs), lty='dashed')
legend('topleft', legend=c('population','sample', 'observed sample'), col=c('red', 'red', 'black'), lty=c('solid','dotted','dashed'))

dist_gaus

The STAN model should look something like this:

mymodel <-"
data {
 int n;
 real xobs[n]; //observed x
 real dx;
}

transformed data {
 // real xcut = 6.0; //use for a fixed selection... generally selection is unknown!
}

parameters {
 real xcut; //selection value
 real mu;
 real sigma;
 real xtrue[n]; //true sample x
}

model {
 xcut ~ normal(6,1);
 for(i in 1:n){
  /* we truncate a lower bound at xcut using T[lower,upper]. This already includes normalisation. Also note with truncation we need to do sampling in a for loop! */
  xtrue[i] ~ normal(mu, sigma) T[xcut,];
 }
 xobs ~ normal(xtrue, dx);
}

"

We then run the code, with sensible initialisation. Pro-tip: make sure to initiate xtrue above the selection cut!

 nchains = 3 #number of chains

#initialisation
 init1 <- lapply(1:nchains, function(x) list(mu=rnorm(1, 7,1), sigma=rnorm(1,1.5,0.2), xtrue=rnorm(n, 8,0.1), xcut=6.0))

fit = stan(model_code = mymodel, data=list(n=n, xobs=xobs, dx=dx),
 init=init1, chains=nchains, iter=2000,
 control=list(adapt_delta=0.8))

You should get something that looks like this:

print(fit, pars=c('xcut','mu','sigma') )

Inference for Stan model: 4eebe5b93a44437492c76c544c69646a.
3 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=3000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
xcut 6.01 0.00 0.04 5.93 5.98 6.01 6.04 6.08 110 1.03
mu 6.83 0.01 0.22 6.36 6.71 6.86 6.99 7.17 314 1.01
sigma 1.61 0.00 0.11 1.42 1.53 1.60 1.67 1.85 488 1.01


Samples were drawn using NUTS(diag_e) at Wed Jan 24 10:02:33 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

pairs(fit, pars=c('mu','sigma'))

post
We can also make some posterior predictive checks, here we show some data generated from a gaussian of the fitted mean and standard deviation (black), as you can see they agree really well with the population.

params=extract(fit)
xfit=colMeans(params$xtrue)
mufit = mean(params$mu)
sigfit = mean(params$sigma)

plot(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted', main='', xlab='x', ylim=c(0,0.4))
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(x), col=rgb(1,0,0,0.5))
lines(density(xobs),col=rgb(1,0,0,0.5), lty='dashed')
legend('topleft', legend=c('true', 'observed', 'predicted'), col=c('red', 'red', 'black'), lty=c('solid','dashed','dotted') )

post_pred

Case 2: Custom distribution

This is where things get a bit more tricky, but still isn’t a challenge for STAN. Before we jump into the coding, we should discuss the boring part (the math!). When we sample from a truncated pdf what we really are doing is sampling from,

x \sim P_{[a,b]} (x) = \frac{P(x)}{\int_a^b P(u) du} .

A lower selection limit then looks like,

x \sim P_{[xcut,\infty]} (x) = \frac{P(x)}{\int_{xcut}^\infty P(u) du} .

Stan can’t do integrals numerically but it can be done by solving the ODE. Unfortunately the denominator in this is an improper integral and STAN doesn’t deal well with improper integrals so we can re-parameterise the integral in the following way:

\int_{xcut}^\infty P(x) dx = \int_0^1 P\left(xcut + \frac{x}{(1-x)} \right)/(1-x)^2 dx.

Now the next problem is that this integral is numerically unstable. At x=1, the integrand is infinite. To solve this we can just do a set the upper limit to a number that is a little bit smaller than 1, e.g. 0.9999.

Okay so now with the theory out of the way, we can move onto the code. We first generate again some toy data from a custom distribution:

n = 2000 #population size
x = c(rnorm(n/2, 4, 1.5), rnorm(n/2, 7,1)) #x is drawn from a mixture model
cut = 3
xsam = x[which(x > cut)]
ns = length(xsam) #sample size after selection

dx = 0.5 #measurement uncertainty
xobs = xsam + rnorm(ns, 0, dx) #observed sample

plot(density(x), ylim=c(0,0.3), main='', xlab='x', ylab='unnormalised density', col='red')
lines(density(xsam), col='red', lty='dotted')
lines(density(xobs), lty='dashed')
abline(v=cut, col=rgb(0,0,0,0.3))
legend('topleft', legend=c('population','sample', 'observed sample'), col=c('red', 'red', 'black'), lty=c('solid','dotted','dashed'))

toy

We define the STAN model as follows:

custmodel<-"
 functions {
 real custom(real y, real mu1, real mu2, real sigma1, real sigma2){
  //custom pdf
  return 0.5*exp(- square(mu1- y) / (2 * sigma1^2) )/(sigma1*sqrt(2*pi())) +
 0.5*exp(- square(mu2 - y) / (2 * sigma2^2) )/(sigma2*sqrt(2*pi())) ;
 }

real[] N_integrand(real y, real[] state, real[] params, real[] x_r, int[] x_i) {
 //ode to solve
 real mu1 = params[1];
 real mu2 = params[2];
 real sigma1 = params[3];
 real sigma2 = params[4];
 real xcut = params[5];
 real dxdy[1];
 real ynew = xcut + y/(1-y);

 dxdy[1] = custom(ynew, mu1, mu2, sigma1, sigma2)/square(1-y);

 return dxdy;
 }
}

data {
 int n;
 real xobs[n];
 real dx;
}

parameters{
 real xcut;
 ordered[2] means; //ordered because mixture models have identifiability issues
 real sigs[2];
 real xtrue[n];
}

model{
 real norm;
 real theta[5]={means[1], means[2], sigs[1], sigs[2], xcut};

 xcut ~ normal(4.0,1.0);

/* We have to re-parameterise the improper integral int_xcut^inf p(x) dx to int_0^1 p(xcut + x/(1-x))/(1-x)^2 dx */
 norm = integrate_ode_rk45(N_integrand, rep_array(0.0,1), 0.0, rep_array(0.9999,1),
 theta, rep_array(0.0,0),rep_array(0,0))[1,1];

 for(i in 1:n){
  target += log(custom(xtrue[i], means[1], means[2], sigs[1],sigs[2]));
  if(xtrue[i] < xcut){
   target+=negative_infinity();
  }else{
   target+=-log(norm); //normalise pdf
  }
 }
 xobs ~ normal(xtrue, dx);

}

"

Note that the custom distribution is a gaussian mixture model. Recall in a previous post (Multivariate Gaussian Mixture Model done properly) I discussed the problems with mixture models and their identifiability issues. For this example I only run 1 very long chain for the demonstration since the Bayesian Fraction of Missing Information, I needed a lot more iterations and warm-up to achieve a good effective sample size. Here’s the results:

nchains = 1
init1 <- lapply(1:nchains, function(x) list(means=c(4.0,7.0), sigs=c(1,1), xtrue=rep(6.0,ns), xcut=3.0))

fit = stan(model_code = custmodel, data=list(n=ns, x=xobs, dx=dx), init=init1, chains=nchains, iter=5000, control=list(adapt_delta=0.8))
print(fit, pars=c('means','sigs','xcut'))

Inference for Stan model: 0ff67e36d860e76473b011bccd66c477.
1 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=2500.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
means[1] 3.69 0.04 0.42 2.82 3.41 3.71 4.00 4.43 112 1
means[2] 6.88 0.01 0.11 6.65 6.81 6.89 6.96 7.08 156 1
sigs[1] 1.42 0.00 0.13 1.20 1.33 1.41 1.49 1.71 684 1
sigs[2] 1.04 0.00 0.07 0.93 1.00 1.04 1.09 1.18 246 1
xcut 3.06 0.01 0.09 2.86 3.02 3.07 3.13 3.21 61 1

Samples were drawn using NUTS(diag_e) at Thu Jan 25 11:08:37 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

pairs(fit, pars=c('means','sigs', 'xcut'))

post_cust

# posterior predictive checks
params=extract(fit)
means = colMeans(params$means)
sigs = colMeans(params$sigs)

plot(density(xobs), main='', xlab='x', ylab='unnormalised density', ylim=c(0,0.4), lty='dashed', col=rgb(1,0,0,0.5))
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(x), col=rgb(1,0,0,0.5))
abline(v=cut, lty='dashed',col='red')

legend('topleft', legend=c('true', 'observed', 'predicted'), col=c('red', 'red', 'black'), lty=c('solid','dashed','dotted') )

post_pred_cust

Overall good agreement. So yeh. Selection function in STAN? Easy peasy!. Also see https://github.com/farr/SelectionExample and https://github.com/farr/GWDataAnalysisSummerSchool/blob/master/lecture3/lecture3.ipynb for some more examples!

Problems with binning data

Blog

I learned early on in my PhD that binning is not a good idea. The loss of information from binning will lead to biased results. Here I demonstrate this with a simple linear regression toy model.

#generate toy data
x = runif(nc, 13,15)
y = 22+1.5*x+rnorm(nc, 0,0.5)

#bin in y variable
nbin = 10
ybins = seq(min(y), max(y), length.out = nbin)
ybinned = sapply(1:(nbin-1), function(z) mean(y[which(y> ybins[z] & y <= ybins[z+1])]))
ybinnedx = sapply(1:(nbin-1), function(z) mean(x[which(y> ybins[z] & y <= ybins[z+1])]))
 
 
#bin in x variable
xbins = seq(min(x), max(x), length.out = nbin)
xbinned = sapply(1:(nbin-1), function(z) mean(x[which(x> xbins[z] & x <= xbins[z+1])]))
xbinnedy = sapply(1:(nbin-1), function(z) mean(y[which(x> xbins[z] & x <= xbins[z+1])]))

plot(x,y, pch=20, cex=0.1)
abline(a=22, b=1.5)
 
points(ybinnedx, ybinned, pch=20, col='red')
points(xbinned, xbinnedy, pch=20, col='blue')

binning

As seen in the plot, if we bin based on the y variable (red), we will bias our estimation of the slope and intercept whereas binning in the x variable (blue) does not incur this problem. This is because the intrinsic scatter is applied only to the y variable.

The problem intensifies when observational uncertainties are included in the x and y measurements. By modifying the following lines in the code:

#generate toy data
x = runif(nc, 13,15)
y = 22+1.5*x+rnorm(nc, 0,0.5)

to:

#generate toy data
xtrue = runif(nc, 13,15)
ytrue = 22+1.5*xtrue+rnorm(nc, 0,0.5)

x = xtrue + rnorm(nc, 0, 0.5)
y = ytrue + rnorm(nc, 0, 0.5)

we obtain a plot such as the following:

binning_2

Now a bias is seen binning in either x or y. Whilst this may have been exaggerated to large uncertainties for clarity, it is usually more common in Astronomy to bin when data is noisy! A good project for the keen and eager then should be to investigate a way to bin such that the measurement uncertainties and scatter are taken into account so to avoid such biases, but unfortunately I already have too many projects to work on…

My city Coventry

Blog

Okay so a few weeks ago at work a colleague asked me where I was from in the UK, and I replied
“Oh, it’s a small little town, you’ve probably never heard of it, Coventry. It’s close to Birmingham”.
This is a natural response for me given that I lived in the US for a year and the only city anyone had heard of is London.

Actually the worse conversation I ever had went something along the lines of:

Her: “You’re from England, wow you’re english is so good!”
Me *confused*: “Erm, yeh because English people tend to speak english… ”
Her: “Oh really, I thought you spoke french there”
Me: *facepalm*

Anyway. So it turned out that my colleague had indeed heard of coventry since there are at least 2 other ESA employees working here in Spain who are from Coventry.

Statistically wise this is a super amazing opportunity for me to calculate the probability of that actually happening!

So lets say there are ~400 employees at ESA, Spain.

Since employment is swayed by the contribution of ESA member states and the UK’s slice is ~8%, we expect there to be:

0.08 \times 400 = 32 employees at ESA from the UK.

The probability of selecting someone from coventry out of the whole of the UK is given by the ratio of their populations:

\frac{\rm Coventry's\ population}{\rm UK\ population} = \frac{3.2\times 10^5}{6.6\times 10^7} = 0.005

So we would expect

0.005 \times 32 = 0.16 employees from Coventry working at ESA spain!

Since we have at least 3 of them, the likelihood of us all being here is much smaller! Note also that we all come from all over Coventry and went to completely different schools, so its unlikely that there could be some sort of bias. None the less, good job Coventry on producing some great scientists!