Online resources for statisticians

My students often look up statistical methods on Wikipedia. Sometimes they admit this with a hint of embarrassment in their voices. They are right to be cautious when using Wikipedia (not all pages are well-written) and I’m therefore pleased when they ask me if there are other good online resources for statisticians.

I usually tell them that Wikipedia actually is very useful, especially for looking up properties of various distributions, such as density functions, moments and relationships between distributions. I wouldn’t cite the Wikipedia page on, say, the beta distribution in a paper, but if I need to check what the mode of said distribution is, it is the first place that I look. While not as exhaustive as the classic Johnson & Kotz books, the Wikipedia pages on distributions tend to contain quite a bit of surprisingly accurate information. That being said, there are misprints to be found, just as with any textbook (the difference being that you can fix those misprints – I’ve done so myself on a few occasions).

Another often-linked online resource is Wolfram MathWorld. While I’ve used it in the past when looking up topics in mathematics, I’m more than a little reluctant to use it after I happened to stumble upon their description of significance tests:

A test for determining the probability that a given result could not have occurred by chance (its significance).

…which is a gross misinterpretation of hypothesis testing and p-values (a topic which I’ve treated before on this blog).

The one resource that I really recommend though is Cross Validated, a questions-and-answers site for all things statistics. There are some real gems among the best questions and answers, that make worthwhile reading for any statistician. It is also the place to go if you have a statistics question that you are unable to find the answer to, regardless of whether its about how to use the t-test or about the finer aspects of LeCam theory. I strongly encourage all statisticians to add a visit to Cross Validated to their daily schedules. Putting my time where my mouth is, I’ve been actively participating there myself for the last few months.
Finally, Google and Google Scholar are the statistician’s best friends. They are extremely useful for finding articles, lecture notes and anything else that has ended up online. It’s surprising how often the answer to a question that someone asks you is “let me google that for you”.

For questions on R or more mathematical topics, Stack Overflow and the Mathematics Stack Exchange site are the equivalents of Cross Validated.
My German colleagues keep insisting that German Wikipedia is far superior when it comes to statistics. While I can read German fairly well (in a fit of mathematical pretentiousness I once forced myself to read Kolmogorov’s Grundbegriffe), I still haven’t gathered my guts to venture beyond the English version.

Speeding up R computations Pt II: compiling

A year ago I wrote a post on speeding up R computations. Some of the tips that I mentioned then have since been made redundant by a single package: compiler. Forget about worrying about curly brackets and whether to write 3*3 or 3^2 – compiler solves those problems for you.

compiler provides a byte code compiler for R, which offers the possibility of compiling your functions before executing them. Among other things, this speeds up those embarrassingly slow for loops that you’ve been using:

> myFunction<-function() { for(i in 1:10000000) { 1*(1+1) } }
> library(compiler)
> myCompiledFunction <- cmpfun(myFunction) # Compiled function

> system.time( myFunction() )
   user  system elapsed 
 10.002   0.014  10.021 
> system.time( myCompiledFunction() )
   user  system elapsed 
  0.692   0.008   0.700 

That’s 14 times faster!

Functions written in C (imported using Rcpp) are still much faster than the compiled byte code, but for those of us who

  • don’t know C,
  • know C but prefer to write code in R,
  • know C but don’t want to rewrite functions that we’ve already written in R,

compiler offers a great way of speeding up computations. It’s included in the recommended R packages since R 2.13 (meaning that it comes with your basic R installation) and since R 2.14 most standard functions are already compiled. If you still are running an older version of R it’s definitely time to update.

Are Higgs findings due to chance?

As media all over the world are reporting that scientists at CERN may have glimpsed the Higgs particle for the first time, journalists struggle to explain why the physicists are reluctant to say that the results are conclusive. Unfortunately, they seem to be doing so by grossly misinterpreting one of the key concepts in modern statistics: the p-value.

First of all: what does statistics have to do with the Higgs boson findings? It seems that there would be no need for it – either you’ve seen the God particle, or you haven’t seen the Goddamn particle. But since the Higgs boson is so minuscule, it can’t be observed directly. The only chance of detecting it is to look for anomalies in the data from the Large Hadron Collider. There will always be some fluctuations in the data, so you have to look for anomalies that (a) are large enough and (b) are consistent with the properties of the hypothesised Higgs particle.

What the two teams at CERN reported yesterday is that both teams have found largish anomalies in the same region, independently of each other. They quantify how large the anomalies are by describing them as, for instance, as being roughly “two sigma events”. Events that have more sigmas, or standard deviations, associated with them are less likely to happen if there is no Higgs boson. Two sigma events are fairly rare: if there is no Higgs boson then anomalies at least as large as these would only appear in one experiment out of 20. This probability is known as the p-value of the results.

This is were the misinterpretation comes into play. Virtually every single news report on the Higgs findings seem to present a two sigma event as meaning that there is a 1/20 probability of the results being due to chance. Or in other words, that there is a 19/20 probability that the Higgs boson has been found.

In what I think is an excellent explanation of the difference, David Spiegelhalter attributed the misunderstanding to a missing comma:

The number of sigmas does not say ‘how unlikely the result is due to chance’: it measures ‘how unlikely the result is, due to chance’.

The difference may seem subtle… but it’s not. The number of sigmas, or the p-value, only tells you how often you would see this kind of results if there was no Higgs boson. That is not the same as the probability of there being such a particle – there either is or isn’t a Higgs particle. Its existence is independent of the experiments and therefore the experiments don’t tell us anything about the probability of the existence.

What we can say, then, is only that the teams at CERN have found anomalies that suspiciously large, but not so large that we feel that they can’t be due simply to chance. Even if there was no Higgs boson, anomalies of this size would appear in roughly one experiment out of twenty, which means that they are slightly more common than getting five heads in a row when flipping a coin.

If you flip a coin five times and got only heads, you wouldn’t say that since the p-value is 0.03, there is a 97 % chance that the coin only gives heads. The coin either is or isn’t unbalanced. You can’t make statements about the probability of it being unbalanced without having some prior information about how often coins are balanced and unbalanced.

The experiments at CERN are no different from coin flips, but there is no prior information about in how many universes the Higgs boson exists. That’s why the scientists are reluctant to say that they’ve found it. That’s why they need to make more measurement. That’s why two teams are working independently from each other. That’s why there isn’t a 95 % chance that the Higgs boson has been found.

David Spiegelhalter tweeted about his Higgs post with the words “BBC wrong in describing evidence for the Higgs Boson, nerdy pedantic statistician claims”. Is it pedantic to point out that people are fooling themselves?

I wholeheartedly recommend the Wikipedia list of seven common misunderstandings about the p-value.

Speeding up R computations

The past few days I’ve been going through some R code that I wrote last year, when I was preparing a massive simulation-based power study for some tests for multivariate normality that I’ve been working on. My goal was to reduce the time needed to run the simulation. I wasn’t expecting great improvement, since I’ve always believed that the most common R functions are properly vectorized and optimized for speed. Turns out I was wrong. Very wrong.

The first thing that I did was that I replaced all parentheses ( ) by curly brackets { }. I was inspired to do so by this post (and this, via Xi’Ans Og) over at Radford Neal’s blog. As he pointed out, code that uses parentheses is actually slower than the same code with curly brackets:

> system.time( for(i in 1:1000000) { 1*(1+1) } )
   user  system elapsed 
  1.337   0.005   1.349 
> system.time( for(i in 1:1000000) { 1*{1+1} } )
   user  system elapsed 
  1.072   0.003   1.076 

Similarly, you can compare a*a and a^2:
> system.time( for(i in 1:10000000) 3^2 )
   user  system elapsed 
  5.048   0.028   5.088 
> system.time( for(i in 1:10000000) 3*3 )
   user  system elapsed 
  4.721   0.024   4.748 
So, a^2 is slower than a*a. This made me wonder, are there other built-in R functions that are slower than they ought to be?

One thing that I found very surprising, and frankly rather disturbing, is that mean(x) takes ten times as long to calculate the mean value of the 50 real numbers in the vector x as the “manual” function sum(x)/50:

> x<-rnorm(50)
> system.time(for(i in 1:100000){mean(x)})
   user  system elapsed 
  1.522   0.000   1.523 
> system.time(for(i in 1:100000){sum(x)/length(x)})
   user  system elapsed 
  0.200   0.000   0.200 
> system.time(for(i in 1:100000){sum(x)/50})
   user  system elapsed 
  0.167   0.000   0.167 
> system.time(for(i in 1:100000){ overn<-rep(1/50,50); x%*%overn })
   user  system elapsed 
  0.678   0.000   0.677 
> overn<-rep(1/50,50); system.time(for(i in 1:100000){ x%*%overn })
   user  system elapsed 
  0.164   0.000   0.164 

I guess that the R development core team have been focusing on making R an easy-to-use high level programming language rather than optimizing all functions, but the poor performance of mean is just embarrassing.

Similarly, the var function can be greatly improved upon. Here are some of the many possibilites:

> x <- rnorm(50)
> system.time( for(i in 1:100000) { var(x) } )
   user  system elapsed 
  4.921   0.000   4.925 
> system.time( for(i in 1:100000) { sum((x-mean(x))^2)/{length(x)-1} } )
   user  system elapsed 
  2.322   0.000   2.325 
> system.time( for(i in 1:100000) { {sum(x*x)-sum(x)*sum(x)/length(x)}/{length(x)-1} } )
   user  system elapsed 
  0.736   0.000   0.737 
> system.time( for(i in 1:100000) { {sum(x*x)-sum(x)*sum(x)/50}/49 } )
   user  system elapsed 
  0.618   0.000   0.618 

> system.time( for(i in 1:100000) { sx<-sum(x); {sum(x*x)-sx*sx/50}/49 } )
   user  system elapsed 
  0.567   0.000   0.568

I changed all the uses of mean in my code to “sum/n” instead (and avoided using var entirely) and found that this sped things up quite a bit.
Another trick to speed up your computations is to create the vectors that you wish to change within a loop with the right number of elements. While
for(j in 1:100) a[j]<-j 
works just fine, it is actually quite a bit slower than
for(j in 1:100) a[j]<-j
You could create a in other ways as well of course, for instance by a<-vector(length=100). Here are the numbers:

> system.time( for(i in 1:100000) { a<-NA; for(j in 1:100) a[j]<-j })
   user  system elapsed 
 37.383   0.092  37.482 
> system.time( for(i in 1:100000) { a<-rep(NA,100); for(j in 1:100) a[j]<-j })
   user  system elapsed 
 25.866   0.065  25.936 
> system.time( for(i in 1:100000) { a<-vector(length=100); for(j in 1:100) a[j]<-j })
   user  system elapsed 
 25.517   0.022  25.548

In my case, I’d been a bit sloppy with creating the vectors in my loops in the proper way, so I changed this in my code as well.

In my simulation study, I simulate multivariate random variables, compute some test statistics and use these to estimate the powers of the normality tests against various alternatives. After doing the changes mentioned above, I compared the performance of my old code to that of the new code, for 1000 iterations of the procedure:

> system.time( source(“oldCode.R”) )
   user  system elapsed 
548.045   0.273 548.622 
> system.time( source(“newCode.R”) )

   user  system elapsed 
 93.138   0.002  93.194
The improved code is almost 6 times faster than the old code. When you do ten million or so iterations, that matters. A lot.

In conclusion, it’s definitely possible to speed up your code significantly if you know of the pitfalls of R. I suspect that I’ll be obsessed with finding more pitfalls in the next few weeks, so I’d be thankful for any hints about other weaknesses that R has.

It should probably be mentioned that R is really fast when things are properly vectorized. Last year, a coworker that uses Matlab challenged me to perform a number of matrix computations faster in R than in Matlab. To his great surprise, R won.

As a final remark, I’m now facing a bit of a dilemma. Should I write readable code; a^6; or fast code; a*a*a*a*a*a?

Update: looking to speed up your R computations even more? See my posts on compiling your code and parallelization.