This could be a very boring post (even more so than normal) and could be complete nonsense, but anyway. I was watching the final of the Australian Open tennis on Sunday and it struck me that it was quite amazing that in many of these tournaments, it quite often goes to form. Typically those in the final and semi-finals are those who were ranked highest. This may seem obvious but naively one might imagine it to be quite different. These tournaments comprise 100 or so of the best tennis players in the world. Being so much better than everyone else, you might imagine that they would be similar and that a player ranked 30th could, quite often, beat a player in the top 10. You might expect the rankings to change quite regularly. It seems, however, that this isn’t really the case. When a player reaches the top 10 they can stay there for quite some time and that even though these are some of the best tennis players in the world, a player in the top 10 will typically beat a player ranked 30.
I wondered if this wasn’t quite a nice illustration of the properties of a normal distribution. Imagine we could plot a distribution of the abilities of all tennis players. We might expect it to have a normal distribution like that shown in the figure below. There will be a few really bad tennis players, most will be average, and there will be a few very good tennis. I don’t know if this is really what it would be like, but if the sample is sufficiently large, the central limit theorem suggests that it is likely to settle to a normal distribution.
The form of the Normal Distribution is
where N is the size of the sample and σ is the standard deviation (essentially how variable the distribution is). If the distribution is very narrow (i.e., everyone’s abilities are very similar) the standard deviation is small. If the distribution is wide (i.e., the abilities are quite varied) the standard deviation would be big.
What the figure above also shows is the percentage of the sample in each standard deviation interval. For example 34.1% of the sample lie between 0 and 1σ and 13.6% lie between 1σ and 2σ. The table below shows the percentage for each interval up to 6σ
In the above I’ve only considered the intervals above the mean. If one was presenting some data analysis with errors, the error would normally be some number of standard deviations and would tell you how significant the result is. For example 1σ errors would tell you that there was a 68.2% chance that the result lies in the reported range (i.e., 34.1% times 2 since you’re considering the region on either side of the most likely value). If you report a 5σ error (as for the Higgs Boson result) this means that there is a 99.99994% chance of the actual value lying within the reported range (i.e., 100 – 0.00003×2) – although in this case it may actually be that there is a 99.99994% chance that the signal is real, rather than an extremely unlikely noise spike.
Where am I going with this? There appear to be a few tennis players in the world who will typically beat almost anyone else almost all the time. If I assume that there are 10 million active tennis players in the world (I have no idea if this is a reasonable number or not) then the table above would suggest that only 0.00003% of them would have abilities 5-6σ better than the average. This means that there would only typically be 3 players who have this level of ability (i.e., 0.00003/100×10000000). Essentially, if tennis ability is normally distributed you would actually expect there to be only a few players in the world who are significantly better than the average. So, maybe this makes perfect sense. As you consider the extremes, there are fewer and fewer players and if you have a big enough sample (and given that many play tennis, the sample is probably quite large) you have a chance of finding a small number who are so much better than the rest that they would typically beat almost anyone else. Alternatively, this is all nonsense and I have no idea what I’m talking about.