A reply to Testing via credible sets

Last week I posted a manuscript on arXiv entitled On decision-theoretic justifications for Bayesian hypothesis testing through credible sets. A few days later, a discussion of it appeared on Xi’ans’ Og. I’ve read papers and books by Christian Robert with great interest and have been a follower of his “Og” for quite some time, and so was honoured and excited when he chose to blog about my work. I posted a comment to his blog post, but for some reason or other it has not yet appeared on the site. I figured that I’d share my thoughts on his comments here on my own blog for the time being.

The main goal of the paper was to discuss decision-theoretic justifications for testing the point-null hypothesis Θ0={θ0} against the alternative Θ1={θ: θ≠θ0} using credible sets. In this test procedure, Θ0 is rejected if θis not in the credible set. This is not the standard solution to the problem, but certainly not uncommon (I list several examples in the introduction to the paper). Tests of composite hypotheses are also discussed.

Judging from his blog post, Xi’an is not exactly in love with the manuscript. (Hmph! What does he know about Bayesian decision theory anyway? It’s not like he wrote the book on… oh, wait.) To some extent however, I think that his criticism is due to a misunderstanding.

Before we get to the misunderstanding though: Xi’an starts out by saying that he doesn’t like point-null hypothesis testing, so the prior probability that he would like it was perhaps not that great. I’m not crazy about point-null hypotheses either, but the fact remains that they are used a lot in practice and that there are situations where they are very natural. Xi’an himself gives a few such examples in Section 5.2.4 of The Bayesian Choice, as do Berger and Delampady (1987).

What is not all that natural, however, is the standard Bayesian solution to point-null hypothesis testing. It requires a prior with a mass on θ0, which seems like a very artificial construct to me. Apart from leading to such complications as Lindley’s paradox, it leads to very partial priors. Casella and Berger (1987, Section 4) give an example where the seemingly impartial prior probabilities P(θ0)=1/2 and P(Θ1)=1/2 actually yield a test with strong bias towards the null hypothesis. One therefore has to be extremely careful when applying the standard tests of point-null hypotheses, and carefully think about what the point-mass really means and how it affects the conclusions.

Tests based on credible sets, on the other hand, allows us to use a nice continuous prior for θ. It can, unlike the prior used in the standard solution, be non-informative. As for informative priors, it is often easier to construct a continuous prior based on expert opinion than it is to construct a mixed prior.

Theorem 2 of my paper presents a weighted 0-1-type loss function that leads to the acceptance region being the central (symmetric) credible interval. The prior distribution is assumed to be continuous, with no point-mass in θ0. The loss is constructed using directional conclusions, meaning that when θ0 is rejected, it is rejected in favour of either {θ: θ<θ0} or {θ: θ>θ0}, instead of simply being rejected in favour of {θ: θ≠θ0}. Indeed, this is how credible and confidence intervals are used in practice: if θis smaller than all values in the interval, then θis rejected and we conclude that θ>θ0. The theorem shows that tests based on central intervals can be viewed as a solution to the directional three-decision problem – a solution that does not require a point-mass for the null hypothesis. I therefore do not agree with Xi’an’s comment that “[tests using credible sets] cannot bypass the introduction of a prior mass on Θ0“. While a test traditionally only has one way to reject the null hypothesis, allowing two different directions in which Θcan be rejected seems perfectly reasonable for the point-null problem.

Regarding this test, Xi’an writes that it essentially [is] a composition of two one-sided tests, […], so even at this face-value level, I do not find the result that convincing”. But any (?) two-sided test can be said to be a composition of two one-sided tests (and therefore implicitly includes a directional conclusion), so I’m not sure why he regards it as a reason to remain unconvinced about the validity of the result.

As for the misunderstanding, Theorem 3 of the paper deals with one-sided hypothesis tests. It was not meant as an attempt to solve the problem of testing point-null hypotheses, but rather to show how credible sets can be used to test composite hypotheses – as was Theorem 4. Xi’an’s main criticism of the paper seems to be that the tests in Theorems 3 and 4 fail for point-null hypotheses, but they were never meant to be used for such hypotheses in the first place. After reading his comments, I realized that this might not have been perfectly clear in the first draft of the paper. In particular, the abstract seemed to imply that the paper only dealt with point-null hypotheses, which is not the case. In the submitted version (not yet uploaded to arXiv), I’ve tried to make the fact that both point-null and composite hypotheses are studied clearer.

There are certainly reasons to question the use of credible sets for testing, chief among them being that the evidence against Θis evaluated in a roundabout way. On the other hand, credible sets are reasonably easy to compute and tend to have favourable properties in frequentist analysis. It seems to me that a statistician that would like to use a method that is reasonable both in Bayesian and frequentist inference would want to consider tests based on credible sets.

Online resources for statisticians

My students often look up statistical methods on Wikipedia. Sometimes they admit this with a hint of embarrassment in their voices. They are right to be cautious when using Wikipedia (not all pages are well-written) and I’m therefore pleased when they ask me if there are other good online resources for statisticians.

I usually tell them that Wikipedia actually is very useful, especially for looking up properties of various distributions, such as density functions, moments and relationships between distributions. I wouldn’t cite the Wikipedia page on, say, the beta distribution in a paper, but if I need to check what the mode of said distribution is, it is the first place that I look. While not as exhaustive as the classic Johnson & Kotz books, the Wikipedia pages on distributions tend to contain quite a bit of surprisingly accurate information. That being said, there are misprints to be found, just as with any textbook (the difference being that you can fix those misprints – I’ve done so myself on a few occasions).

Another often-linked online resource is Wolfram MathWorld. While I’ve used it in the past when looking up topics in mathematics, I’m more than a little reluctant to use it after I happened to stumble upon their description of significance tests:

A test for determining the probability that a given result could not have occurred by chance (its significance).

…which is a gross misinterpretation of hypothesis testing and p-values (a topic which I’ve treated before on this blog).

The one resource that I really recommend though is Cross Validated, a questions-and-answers site for all things statistics. There are some real gems among the best questions and answers, that make worthwhile reading for any statistician. It is also the place to go if you have a statistics question that you are unable to find the answer to, regardless of whether its about how to use the t-test or about the finer aspects of LeCam theory. I strongly encourage all statisticians to add a visit to Cross Validated to their daily schedules. Putting my time where my mouth is, I’ve been actively participating there myself for the last few months.
Finally, Google and Google Scholar are the statistician’s best friends. They are extremely useful for finding articles, lecture notes and anything else that has ended up online. It’s surprising how often the answer to a question that someone asks you is “let me google that for you”.

For questions on R or more mathematical topics, Stack Overflow and the Mathematics Stack Exchange site are the equivalents of Cross Validated.
My German colleagues keep insisting that German Wikipedia is far superior when it comes to statistics. While I can read German fairly well (in a fit of mathematical pretentiousness I once forced myself to read Kolmogorov’s Grundbegriffe), I still haven’t gathered my guts to venture beyond the English version.

Are Higgs findings due to chance?

As media all over the world are reporting that scientists at CERN may have glimpsed the Higgs particle for the first time, journalists struggle to explain why the physicists are reluctant to say that the results are conclusive. Unfortunately, they seem to be doing so by grossly misinterpreting one of the key concepts in modern statistics: the p-value.

First of all: what does statistics have to do with the Higgs boson findings? It seems that there would be no need for it – either you’ve seen the God particle, or you haven’t seen the Goddamn particle. But since the Higgs boson is so minuscule, it can’t be observed directly. The only chance of detecting it is to look for anomalies in the data from the Large Hadron Collider. There will always be some fluctuations in the data, so you have to look for anomalies that (a) are large enough and (b) are consistent with the properties of the hypothesised Higgs particle.

What the two teams at CERN reported yesterday is that both teams have found largish anomalies in the same region, independently of each other. They quantify how large the anomalies are by describing them as, for instance, as being roughly “two sigma events”. Events that have more sigmas, or standard deviations, associated with them are less likely to happen if there is no Higgs boson. Two sigma events are fairly rare: if there is no Higgs boson then anomalies at least as large as these would only appear in one experiment out of 20. This probability is known as the p-value of the results.

This is were the misinterpretation comes into play. Virtually every single news report on the Higgs findings seem to present a two sigma event as meaning that there is a 1/20 probability of the results being due to chance. Or in other words, that there is a 19/20 probability that the Higgs boson has been found.

In what I think is an excellent explanation of the difference, David Spiegelhalter attributed the misunderstanding to a missing comma:

The number of sigmas does not say ‘how unlikely the result is due to chance’: it measures ‘how unlikely the result is, due to chance’.

The difference may seem subtle… but it’s not. The number of sigmas, or the p-value, only tells you how often you would see this kind of results if there was no Higgs boson. That is not the same as the probability of there being such a particle – there either is or isn’t a Higgs particle. Its existence is independent of the experiments and therefore the experiments don’t tell us anything about the probability of the existence.

What we can say, then, is only that the teams at CERN have found anomalies that suspiciously large, but not so large that we feel that they can’t be due simply to chance. Even if there was no Higgs boson, anomalies of this size would appear in roughly one experiment out of twenty, which means that they are slightly more common than getting five heads in a row when flipping a coin.

If you flip a coin five times and got only heads, you wouldn’t say that since the p-value is 0.03, there is a 97 % chance that the coin only gives heads. The coin either is or isn’t unbalanced. You can’t make statements about the probability of it being unbalanced without having some prior information about how often coins are balanced and unbalanced.

The experiments at CERN are no different from coin flips, but there is no prior information about in how many universes the Higgs boson exists. That’s why the scientists are reluctant to say that they’ve found it. That’s why they need to make more measurement. That’s why two teams are working independently from each other. That’s why there isn’t a 95 % chance that the Higgs boson has been found.

David Spiegelhalter tweeted about his Higgs post with the words “BBC wrong in describing evidence for the Higgs Boson, nerdy pedantic statistician claims”. Is it pedantic to point out that people are fooling themselves?

I wholeheartedly recommend the Wikipedia list of seven common misunderstandings about the p-value.

Narcolepsy and the swine flu vaccine

Earlier this year the Finnish National Institute for Health and Wellfare published a report about the increased risk of narcolepsy observed among children and adolescents vaccinated with PandemrixR. In short, the conclusion was that the swine flu vaccine seemed to have had an unexpected side effect; the risk of narcolepsy, a sleeping disorder disease, was larger for vaccinated children in the 4-19 year age group than for unvaccinated children in the same age group.

Steven Novella at Science-Based Medicine wrote a great piece about this last week, discussing how this should be interpreted. I’m not going to go into a discussion about the findings themselves, but I would like to discuss the following part of the press release:

In Finland during years 2009–10, 60 children and adolescents aged 4-19 years fell ill with narcolepsy. These figures base on data from hospitals and primary care, and the review of individual patient records by a panel of neurologists and sleep researchers. Of those fallen ill, 52 (almost 90 percent) had received Pandemrix® vaccine, while the vaccine coverage in the entire age group was 70 percent. Based on the preliminary analyses, the risk of falling ill with narcolepsy among those vaccinated in the 4-19 years age group was 9-fold in comparison to those unvaccinated in the same age group.

Sceptical commenters on blogs and forums have questioned whether a 9-fold increase in risk really was observed. Here’s the reasoning:

The estimated risk within a group is (the number of observed cases of the disease)/(size of the group). That is,

Risk for vaccinated child: 52/(n*0.7) = 1/n * 52/0.7
Risk for unvaccinated child: 8/(n*0.3) = 1/n * 8/0.3

where n is the number of children in the 4-19 age group.

So that the relative risk, i.e. the risk increase for the vaccinated children, is

(52/0.7)/(8/0.3)=2.79 .

Hang on a minute. 2.79? If a 9-fold increase in risk was observed the relative risk should be 9! It seems that the Finnish epidemiologists made a mistake.

…or did they?

Not necessarily. When analyzing this data, we need to take time into account. The report itself is only available in Finnish, but using Google Translate I gathered that the unvaccinated group were studied from January 2009 to August 2010 whereas the vaccinated individuals were studied from the date of vaccination and eight months on. In other words, the unvaccinated group had a longer time span in which they could fall ill.

That means that in order to calculate the relative risk, we need to divide the number of cases by the number of months that the groups were studied, to get the risk per month. That eliminates the time factor. After doing this, the relative risk becomes


That’s higher, but still not 9. Well, to complicate things a bit it seems that an individual was considered to be a part of the unvaccinated group until the date of vaccination, making the calculations a bit more difficult. When that is taken into account, along with other difficulties that no doubt occur when you have the actual data at hand, the relative risk probably becomes 9.

The full report is not yet available, so I can’t say how close the above approach is to the one that was actually used in the analysis. Nevertheless, I hope that this post can help shed some light on the statistics behind the statement about a 9-fold increase.

A problem with this approach is that the number of months under which the unvaccinated group was studied might affect the results, just as in the shark attack example that I wrote about last week. Changing the time span for the unvaccinated group to January 2008 to August 2010, say, does however not change the conclusion in this case. The analysis seems to be pretty robust to the length of time under which the control group were studied.

WHO issued some comments regarding the Finnish study that are well worth reading.