I learned early on in my PhD that binning is not a good idea. The loss of information from binning will lead to biased results. Here I demonstrate this with a simple linear regression toy model.
#generate toy data x = runif(nc, 13,15) y = 22+1.5*x+rnorm(nc, 0,0.5) #bin in y variable nbin = 10 ybins = seq(min(y), max(y), length.out = nbin) ybinned = sapply(1:(nbin-1), function(z) mean(y[which(y> ybins[z] & y <= ybins[z+1])])) ybinnedx = sapply(1:(nbin-1), function(z) mean(x[which(y> ybins[z] & y <= ybins[z+1])])) #bin in x variable xbins = seq(min(x), max(x), length.out = nbin) xbinned = sapply(1:(nbin-1), function(z) mean(x[which(x> xbins[z] & x <= xbins[z+1])])) xbinnedy = sapply(1:(nbin-1), function(z) mean(y[which(x> xbins[z] & x <= xbins[z+1])])) plot(x,y, pch=20, cex=0.1) abline(a=22, b=1.5) points(ybinnedx, ybinned, pch=20, col='red') points(xbinned, xbinnedy, pch=20, col='blue')
As seen in the plot, if we bin based on the y variable (red), we will bias our estimation of the slope and intercept whereas binning in the x variable (blue) does not incur this problem. This is because the intrinsic scatter is applied only to the y variable.
The problem intensifies when observational uncertainties are included in the x and y measurements. By modifying the following lines in the code:
#generate toy data x = runif(nc, 13,15) y = 22+1.5*x+rnorm(nc, 0,0.5)
to:
#generate toy data xtrue = runif(nc, 13,15) ytrue = 22+1.5*xtrue+rnorm(nc, 0,0.5) x = xtrue + rnorm(nc, 0, 0.5) y = ytrue + rnorm(nc, 0, 0.5)
we obtain a plot such as the following:
Now a bias is seen binning in either x or y. Whilst this may have been exaggerated to large uncertainties for clarity, it is usually more common in Astronomy to bin when data is noisy! A good project for the keen and eager then should be to investigate a way to bin such that the measurement uncertainties and scatter are taken into account so to avoid such biases, but unfortunately I already have too many projects to work on…