Problems with binning data


I learned early on in my PhD that binning is not a good idea. The loss of information from binning will lead to biased results. Here I demonstrate this with a simple linear regression toy model.

#generate toy data
x = runif(nc, 13,15)
y = 22+1.5*x+rnorm(nc, 0,0.5)

#bin in y variable
nbin = 10
ybins = seq(min(y), max(y), length.out = nbin)
ybinned = sapply(1:(nbin-1), function(z) mean(y[which(y> ybins[z] & y <= ybins[z+1])]))
ybinnedx = sapply(1:(nbin-1), function(z) mean(x[which(y> ybins[z] & y <= ybins[z+1])]))
#bin in x variable
xbins = seq(min(x), max(x), length.out = nbin)
xbinned = sapply(1:(nbin-1), function(z) mean(x[which(x> xbins[z] & x <= xbins[z+1])]))
xbinnedy = sapply(1:(nbin-1), function(z) mean(y[which(x> xbins[z] & x <= xbins[z+1])]))

plot(x,y, pch=20, cex=0.1)
abline(a=22, b=1.5)
points(ybinnedx, ybinned, pch=20, col='red')
points(xbinned, xbinnedy, pch=20, col='blue')


As seen in the plot, if we bin based on the y variable (red), we will bias our estimation of the slope and intercept whereas binning in the x variable (blue) does not incur this problem. This is because the intrinsic scatter is applied only to the y variable.

The problem intensifies when observational uncertainties are included in the x and y measurements. By modifying the following lines in the code:

#generate toy data
x = runif(nc, 13,15)
y = 22+1.5*x+rnorm(nc, 0,0.5)


#generate toy data
xtrue = runif(nc, 13,15)
ytrue = 22+1.5*xtrue+rnorm(nc, 0,0.5)

x = xtrue + rnorm(nc, 0, 0.5)
y = ytrue + rnorm(nc, 0, 0.5)

we obtain a plot such as the following:


Now a bias is seen binning in either x or y. Whilst this may have been exaggerated to large uncertainties for clarity, it is usually more common in Astronomy to bin when data is noisy! A good project for the keen and eager then should be to investigate a way to bin such that the measurement uncertainties and scatter are taken into account so to avoid such biases, but unfortunately I already have too many projects to work on…