Galaxy cluster classification with convolutional neural networks


During my time at ESA, I mentored a Master’s student, Matej Kosiba from Masaryk University, and we recently published a paper together on using convolutional neural networks to detect clusters of galaxies (

You see, the detection of galaxy clusters is extremely out-dated. Typically a wavelet filtering algorithm is applied on X-ray images to pull out candidate clusters, and then a team of “experts” will examine each candidate, often combined with access to the corresponding optical data on the sky, to verify if the object is actually a cluster. More often than not, the candidates will be contaminated with point sources (stars, AGN), nearby galaxies and artefacts within the data. So when you have perhaps 1000 candidate objects, with 2 experts looking at each candidate it still is doable. But when X-ray observatory e-ROSITA comes online ~100,000 the traditional methods of detection will be inefficient.

The research therefore turned to machine learning, which is more adapted to dealing with huge numbers of tasks quickly and in an objective manner. In particular, convolutional neural networks which were developed specifically to apply to image data. Convolutional neural networks (which I’ll leave the details for another post) learn the hierarchical structure of features within images of the same class, and are relatively insensitive to small offsets or changes in the images.

To train the network we require image data and the corresponding class labels. The image data are X-ray images stacked onto the Optical images. For the class labels, we take 2 approaches. The first approach used the power of citizen science to crowd classify large samples of the data by large numbers of volunteers. To do this the authors created a Zooniverse project attracting thousands of volunteers. The second approach took labels classified by 2 experts. We compared them using a spectroscopically confirmed cross calibration sample.

Screenshot 2020-06-12 at 09.19.12

This is the money plot, it’s called a ROC curve (receiver operating characteristic curve) and shows the true positive rate vs false positive rate for different thresholds of classification. The idea is that is the lines lie on the diagonal, then the networks are performing as well as chance. Ideally you want the lines to draw up to the top left hand corner. In this plot we see the various networks tried: CN-E (custom network trained on expert labels), MN-E (Mobilenet trained on expert labels), CN-Z (custom network trained on Zooniverse labels).

The main result of the paper is that CNNs could achieve over 90% accuracy on classifying clusters of galaxies, on multi-wavelength data (X-ray plus optical) and at a tiny fraction of the time it takes for an expert to do the same. Additionally we found that the X-ray data were better for classification of clusters, however this may be due to the limited depth of the optical data we had at hand.



Will there be a last man on the moon?


I was born in 1990. Mankind first stepped on the moon on July 20th 1969. We did it again in November of the same year, again in February and July of 1971, and in April and December of the following year.

In total there have been 12 men to have set foot on another celestial body. I have never known a day where I have not been sharing the Earth with a lunar astronaut. But today 8 of those 12 heroes have passed away. The 4 remaining, Buzz Aldrin, David Scott, Charles Duke and Harrison Schmitt are aged 89, 87, 83 and 84 respectively.

We haven’t been back to the moon since 1972. NASA have a target to get another human there with the Artemis mission by 2024, but is this already too late?

So my question to you is Is it likely that we will see another human on the moon before all our lunar astronauts go extinct? or will it soon be the day of the last man on the moon?

We can use statistics to help us infer the odds.

Firstly let’s take mortality predictions from the US social security actuarial life table of 2016.

data = read.table('life_exp.txt',header=TRUE)

You can download the data here.

Next we calculate the probability that an astronaut will die within n years, given their age. This is given by,

P(death in N years | current age) =
P(death in year N | age in year N) \times P(alive in year N-1 | current age) + P(death in year N-1 | current age)

pdeadn = function(startyr = 2019, endyr = 2020, birthyr){
  age = startyr - birthyr
  nyrs = endyr-startyr+1 #add one because you count the first year as 2016
  cat('number of years:', nyrs, '\n')
  pdead[1] = data$pdeath_male[which(data$age == age)]
  for(i in 2:nyrs){
    age = age+1
    pdead[i] = (1.0 - pdead[i - 1])*data$pdeath_male[which(data$age == age)] + pdead[i-1]
  cat('final age:', age,'\n')

We apply this to our 4 alive astronauts, who were born in 1930, 1932, 1936 and 1935, to get the probability of their deaths up until 2040.

startyr = 2019 #start at current year since we know they are alive
endyr   = 2040
birthyr = c(1930, 1932, 1936, 1935)
pdead   = sapply(birthyr, function(x) pdeadn(startyr=startyr, endyr=endyr, birthyr=x)) #probability of death

Assuming that the deaths of any of the astronauts are independent events, the probability that all the lunar astronauts are dead is given by the multiplication rule,

P(all astronauts dead) = P(Aldrin dead) \times P(Scott dead) \times P(Duke dead) \times P(Schmitt dead)

Applying this we can obtain the following plot of the probability that all astronauts are dead as a function of time,

yrs      = seq(2019, 2040)
pdeadall = apply(pdead, 1, prod)
plot(yrs, pdeadall, ty = 'l', tck = 0.02, xlab = 'year', ylab = 'P(all lunar astronauts dead)', xlim = c(2019, 2040))
axis(side = 3, tck = 0.02, labels = FALSE)
axis(side = 4, tck = 0.02, labels = FALSE)
abline(v = 2024, lty = 'dashed', col = 'orange', lwd = 3)


The orange line shows the goal launch date of NASA’s Artemis program to get the first woman and next man on the moon. From this we can see that there is a 12% probability that all the Apollo Astronauts will not see the next landing on the moon. However, experience shows that space launches are never delivered on time. The following plot shows the predicted launch date of NASA’s Hubble Space Telescope as a function of time.


Hubble launched in 1990, 2.3 times the original scheduled launch time. If the Artemis program faces similar delays, we won’t see a moon landing until 2031, by which case the probability that we will go back on progress, and that I will see a day where no man has stepped on the moon will have increased to 78%, and by 2034 (the set goal of the Chinese to land on the moon) almost certainly (93%) this will be the case 😦

Have you got what it takes to be the next Cosmonaut?

Blog, Glossary

Russian space agency ROSCOSMOS this week announced the opening of the next selection programme of cosmonauts (Russian astronaut) see:

Don’t read Russian? No problem, I have you sorted:


You’re welcome! Applications are accepted from Russian citizens between now and 1 June 2020, and at the end of it there will be 4-6 newly selected Cosmonauts! Could that be you?

Interestingly they require a minimum footsize of 29.5cm… I wonder what that could be for?

Whilst here you know see the entry requirements for a Russian Astronaut, in my upcoming youtube video, I’ll be talking about what it takes it be an astronaut in general, so if you don’t follow my youtube channel yet, you really should!

Surrogate models


At StanCon 2018 Helsinki, a couple of us gathered together for this thing Breck organised – birds of a feather, to discuss topics of interest with each other. I think there were about 8 of us interested in surrogate models and these are some of the notes from that meeting.

So what is a surrogate model?

Let’s say you have a function

x = F(\alpha, \beta, ...)

you could approximate the function F with a surrogate model

x \approx \hat{F}(\alpha, \beta, ...)

Why would you want to do this?

Some ideas we came up with are:

  1.  When you have a complex function, which is difficult to code up but you can generate many simulations from it.
  2.  You have a very expensive, slow function, and you need in service modelling so need results fast.
  3.  You have no idea what the function is, you might have some data for it, but this is likely to be sparse

How do you go about doing this?

The most common methods involve some sort of interpolation:

  • splines, polynomial interpolations
  • neural networks
  • gaussian processes

In cases where you have some sparse data and an unknown/complex function we decided that using either a probability of detection (POD) model, principle component analysis (PCA) or singular value decomposition (SVD) would be good idea.

Some food for thought

  • If in our model we have something like:

y \sim N(F(a,b,...),\sigma)

and we substitute in a surrogate model

y \sim N(\hat{F}(a,b,...),\sigma)

It seems like \sigma is just another additional parameter and so maybe we could consider ingesting it directly into our surrogate model.

  • Where do surrogate models fit into statistical models?
  • How can we make surrogate models happen?
  • How can this generalise to different problems?
  • How can we diagnose it?

Selection function in STAN


I’m currently working on a problem where I need to sample from a very specific custom pdf. It has taken me a lot of time to get this pdf written down but now that I finally have it, I’m just missing one key ingredient… the selection function! This is when you have some underlying data (lets say some images of stars), but the observation of that data has been truncated (for example limitations of the telescope mean we can only see stars brighter than 25 mag) and then scattered by some noise. The selection function is the truncation, it is how your sample is selected from the underlying population. Generally the selection is unknown so the best way to tackle it is to fit for it.

It took me a really long time (2 days!) to sit down and figure out how to do a selection function in STAN. The problem was that I couldn’t really find any working examples on the ‘interwebs‘ but turns out that its actually really easy and really fast.

Here I demonstrate with some easy examples.

Case 1: Simple Gaussian.

First we need to generate some toy data.

nx = 1000 #number of data points in population
mu = 7 
sig = 1.5
x = rnorm(nx, mu, sig) #underlying population values

cut = 6
xcut = x[which(x > cut)] #sample values after selection
n = length(xcut)

dx = 0.2 #uncertainty on x
xobs = xcut + rnorm(n, 0,dx ) #observed values

#make kernel density plot of the distributions
plot(density(x), ylim=c(0,0.4), main='', col='red', xlab='x', ylab='unnormalised density')
abline(v=cut, col=rgb(0,0,0,0.3))
lines(density(xcut), lty='dotted', col='red')
lines(density(xobs), lty='dashed')
legend('topleft', legend=c('population','sample', 'observed sample'), col=c('red', 'red', 'black'), lty=c('solid','dotted','dashed'))


The STAN model should look something like this:

mymodel <-"
data {
 int n;
 real xobs[n]; //observed x
 real dx;

transformed data {
 // real xcut = 6.0; //use for a fixed selection... generally selection is unknown!

parameters {
 real xcut; //selection value
 real mu;
 real sigma;
 real xtrue[n]; //true sample x

model {
 xcut ~ normal(6,1);
 for(i in 1:n){
  /* we truncate a lower bound at xcut using T[lower,upper]. This already includes normalisation. Also note with truncation we need to do sampling in a for loop! */
  xtrue[i] ~ normal(mu, sigma) T[xcut,];
 xobs ~ normal(xtrue, dx);


We then run the code, with sensible initialisation. Pro-tip: make sure to initiate xtrue above the selection cut!

 nchains = 3 #number of chains

 init1 <- lapply(1:nchains, function(x) list(mu=rnorm(1, 7,1), sigma=rnorm(1,1.5,0.2), xtrue=rnorm(n, 8,0.1), xcut=6.0))

fit = stan(model_code = mymodel, data=list(n=n, xobs=xobs, dx=dx),
 init=init1, chains=nchains, iter=2000,

You should get something that looks like this:

print(fit, pars=c('xcut','mu','sigma') )

Inference for Stan model: 4eebe5b93a44437492c76c544c69646a.
3 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=3000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
xcut 6.01 0.00 0.04 5.93 5.98 6.01 6.04 6.08 110 1.03
mu 6.83 0.01 0.22 6.36 6.71 6.86 6.99 7.17 314 1.01
sigma 1.61 0.00 0.11 1.42 1.53 1.60 1.67 1.85 488 1.01

Samples were drawn using NUTS(diag_e) at Wed Jan 24 10:02:33 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

pairs(fit, pars=c('mu','sigma'))

We can also make some posterior predictive checks, here we show some data generated from a gaussian of the fitted mean and standard deviation (black), as you can see they agree really well with the population.

mufit = mean(params$mu)
sigfit = mean(params$sigma)

plot(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted', main='', xlab='x', ylim=c(0,0.4))
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(rnorm(n, mufit, sigfit)), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(x), col=rgb(1,0,0,0.5))
lines(density(xobs),col=rgb(1,0,0,0.5), lty='dashed')
legend('topleft', legend=c('true', 'observed', 'predicted'), col=c('red', 'red', 'black'), lty=c('solid','dashed','dotted') )


Case 2: Custom distribution

This is where things get a bit more tricky, but still isn’t a challenge for STAN. Before we jump into the coding, we should discuss the boring part (the math!). When we sample from a truncated pdf what we really are doing is sampling from,

x \sim P_{[a,b]} (x) = \frac{P(x)}{\int_a^b P(u) du} .

A lower selection limit then looks like,

x \sim P_{[xcut,\infty]} (x) = \frac{P(x)}{\int_{xcut}^\infty P(u) du} .

Stan can’t do integrals numerically but it can be done by solving the ODE. Unfortunately the denominator in this is an improper integral and STAN doesn’t deal well with improper integrals so we can re-parameterise the integral in the following way:

\int_{xcut}^\infty P(x) dx = \int_0^1 P\left(xcut + \frac{x}{(1-x)} \right)/(1-x)^2 dx.

Now the next problem is that this integral is numerically unstable. At x=1, the integrand is infinite. To solve this we can just do a set the upper limit to a number that is a little bit smaller than 1, e.g. 0.9999.

Okay so now with the theory out of the way, we can move onto the code. We first generate again some toy data from a custom distribution:

n = 2000 #population size
x = c(rnorm(n/2, 4, 1.5), rnorm(n/2, 7,1)) #x is drawn from a mixture model
cut = 3
xsam = x[which(x > cut)]
ns = length(xsam) #sample size after selection

dx = 0.5 #measurement uncertainty
xobs = xsam + rnorm(ns, 0, dx) #observed sample

plot(density(x), ylim=c(0,0.3), main='', xlab='x', ylab='unnormalised density', col='red')
lines(density(xsam), col='red', lty='dotted')
lines(density(xobs), lty='dashed')
abline(v=cut, col=rgb(0,0,0,0.3))
legend('topleft', legend=c('population','sample', 'observed sample'), col=c('red', 'red', 'black'), lty=c('solid','dotted','dashed'))


We define the STAN model as follows:

 functions {
 real custom(real y, real mu1, real mu2, real sigma1, real sigma2){
  //custom pdf
  return 0.5*exp(- square(mu1- y) / (2 * sigma1^2) )/(sigma1*sqrt(2*pi())) +
 0.5*exp(- square(mu2 - y) / (2 * sigma2^2) )/(sigma2*sqrt(2*pi())) ;

real[] N_integrand(real y, real[] state, real[] params, real[] x_r, int[] x_i) {
 //ode to solve
 real mu1 = params[1];
 real mu2 = params[2];
 real sigma1 = params[3];
 real sigma2 = params[4];
 real xcut = params[5];
 real dxdy[1];
 real ynew = xcut + y/(1-y);

 dxdy[1] = custom(ynew, mu1, mu2, sigma1, sigma2)/square(1-y);

 return dxdy;

data {
 int n;
 real xobs[n];
 real dx;

 real xcut;
 ordered[2] means; //ordered because mixture models have identifiability issues
 real sigs[2];
 real xtrue[n];

 real norm;
 real theta[5]={means[1], means[2], sigs[1], sigs[2], xcut};

 xcut ~ normal(4.0,1.0);

/* We have to re-parameterise the improper integral int_xcut^inf p(x) dx to int_0^1 p(xcut + x/(1-x))/(1-x)^2 dx */
 norm = integrate_ode_rk45(N_integrand, rep_array(0.0,1), 0.0, rep_array(0.9999,1),
 theta, rep_array(0.0,0),rep_array(0,0))[1,1];

 for(i in 1:n){
  target += log(custom(xtrue[i], means[1], means[2], sigs[1],sigs[2]));
  if(xtrue[i] < xcut){
   target+=-log(norm); //normalise pdf
 xobs ~ normal(xtrue, dx);



Note that the custom distribution is a gaussian mixture model. Recall in a previous post (Multivariate Gaussian Mixture Model done properly) I discussed the problems with mixture models and their identifiability issues. For this example I only run 1 very long chain for the demonstration since the Bayesian Fraction of Missing Information, I needed a lot more iterations and warm-up to achieve a good effective sample size. Here’s the results:

nchains = 1
init1 <- lapply(1:nchains, function(x) list(means=c(4.0,7.0), sigs=c(1,1), xtrue=rep(6.0,ns), xcut=3.0))

fit = stan(model_code = custmodel, data=list(n=ns, x=xobs, dx=dx), init=init1, chains=nchains, iter=5000, control=list(adapt_delta=0.8))
print(fit, pars=c('means','sigs','xcut'))

Inference for Stan model: 0ff67e36d860e76473b011bccd66c477.
1 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=2500.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
means[1] 3.69 0.04 0.42 2.82 3.41 3.71 4.00 4.43 112 1
means[2] 6.88 0.01 0.11 6.65 6.81 6.89 6.96 7.08 156 1
sigs[1] 1.42 0.00 0.13 1.20 1.33 1.41 1.49 1.71 684 1
sigs[2] 1.04 0.00 0.07 0.93 1.00 1.04 1.09 1.18 246 1
xcut 3.06 0.01 0.09 2.86 3.02 3.07 3.13 3.21 61 1

Samples were drawn using NUTS(diag_e) at Thu Jan 25 11:08:37 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

pairs(fit, pars=c('means','sigs', 'xcut'))


# posterior predictive checks
means = colMeans(params$means)
sigs = colMeans(params$sigs)

plot(density(xobs), main='', xlab='x', ylab='unnormalised density', ylim=c(0,0.4), lty='dashed', col=rgb(1,0,0,0.5))
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(c(rnorm(n/2, means[1], sigs[1]),rnorm(n/2, means[2], sigs[2]))), col=rgb(0,0,0,0.5), lty='dotted')
lines(density(x), col=rgb(1,0,0,0.5))
abline(v=cut, lty='dashed',col='red')

legend('topleft', legend=c('true', 'observed', 'predicted'), col=c('red', 'red', 'black'), lty=c('solid','dashed','dotted') )


Overall good agreement. So yeh. Selection function in STAN? Easy peasy!. Also see and for some more examples!

Problems with binning data


I learned early on in my PhD that binning is not a good idea. The loss of information from binning will lead to biased results. Here I demonstrate this with a simple linear regression toy model.

#generate toy data
x = runif(nc, 13,15)
y = 22+1.5*x+rnorm(nc, 0,0.5)

#bin in y variable
nbin = 10
ybins = seq(min(y), max(y), length.out = nbin)
ybinned = sapply(1:(nbin-1), function(z) mean(y[which(y> ybins[z] & y <= ybins[z+1])]))
ybinnedx = sapply(1:(nbin-1), function(z) mean(x[which(y> ybins[z] & y <= ybins[z+1])]))
#bin in x variable
xbins = seq(min(x), max(x), length.out = nbin)
xbinned = sapply(1:(nbin-1), function(z) mean(x[which(x> xbins[z] & x <= xbins[z+1])]))
xbinnedy = sapply(1:(nbin-1), function(z) mean(y[which(x> xbins[z] & x <= xbins[z+1])]))

plot(x,y, pch=20, cex=0.1)
abline(a=22, b=1.5)
points(ybinnedx, ybinned, pch=20, col='red')
points(xbinned, xbinnedy, pch=20, col='blue')


As seen in the plot, if we bin based on the y variable (red), we will bias our estimation of the slope and intercept whereas binning in the x variable (blue) does not incur this problem. This is because the intrinsic scatter is applied only to the y variable.

The problem intensifies when observational uncertainties are included in the x and y measurements. By modifying the following lines in the code:

#generate toy data
x = runif(nc, 13,15)
y = 22+1.5*x+rnorm(nc, 0,0.5)


#generate toy data
xtrue = runif(nc, 13,15)
ytrue = 22+1.5*xtrue+rnorm(nc, 0,0.5)

x = xtrue + rnorm(nc, 0, 0.5)
y = ytrue + rnorm(nc, 0, 0.5)

we obtain a plot such as the following:


Now a bias is seen binning in either x or y. Whilst this may have been exaggerated to large uncertainties for clarity, it is usually more common in Astronomy to bin when data is noisy! A good project for the keen and eager then should be to investigate a way to bin such that the measurement uncertainties and scatter are taken into account so to avoid such biases, but unfortunately I already have too many projects to work on…




The first ever dark matter day… Why the particle physicists decided to it should coincide with halloween I have no clue. Clearly the trending tags will be dominated by the later occasion right?

Anyway I made a video about dark matter. I hope you enjoy it!

My city Coventry


Okay so a few weeks ago at work a colleague asked me where I was from in the UK, and I replied
“Oh, it’s a small little town, you’ve probably never heard of it, Coventry. It’s close to Birmingham”.
This is a natural response for me given that I lived in the US for a year and the only city anyone had heard of is London.

Actually the worse conversation I ever had went something along the lines of:

Her: “You’re from England, wow you’re english is so good!”
Me *confused*: “Erm, yeh because English people tend to speak english… ”
Her: “Oh really, I thought you spoke french there”
Me: *facepalm*

Anyway. So it turned out that my colleague had indeed heard of coventry since there are at least 2 other ESA employees working here in Spain who are from Coventry.

Statistically wise this is a super amazing opportunity for me to calculate the probability of that actually happening!

So lets say there are ~400 employees at ESA, Spain.

Since employment is swayed by the contribution of ESA member states and the UK’s slice is ~8%, we expect there to be:

0.08 \times 400 = 32 employees at ESA from the UK.

The probability of selecting someone from coventry out of the whole of the UK is given by the ratio of their populations:

\frac{\rm Coventry's\ population}{\rm UK\ population} = \frac{3.2\times 10^5}{6.6\times 10^7} = 0.005

So we would expect

0.005 \times 32 = 0.16 employees from Coventry working at ESA spain!

Since we have at least 3 of them, the likelihood of us all being here is much smaller! Note also that we all come from all over Coventry and went to completely different schools, so its unlikely that there could be some sort of bias. None the less, good job Coventry on producing some great scientists!