An AI’s take on happiness, or, the bandsaw problem in machine learning

Computers don’t have feelings. Perhaps they never will – could feelings and emotions really be taught to an artificial intelligence (AI)? Well, maybe. Although many would say that emotions are best described in the abstract language of art and poetry, quantitative research into emotions like happiness is being carried out even as I’m writing this. Earlier this year, a group of researchers from the University of Tokyo, MIT and the Recruit Institute of Technology posted a paper on arXiv describing a database of “100,000 happy moments” – sentences describing moments when people were happy – and how they collected those moments using crowdsourcing.

100,000 happy moments? That’s quite a lot of happiness. Maybe enough to teach an AI what it means to be happy? This is the story of how I tried to do just that, and what a happy AI can tell us about black-box machine learning methods, the importance of knowing when and why things go wrong, and bandsaws. Lots and lots of bandsaws.


Teaching a neural network to be happy

So, I had decided to train an artificial intelligence (or more precisely, a neural network) to write messages about happy moments, using the “happy moments” database to learn what such messages can look like (the so-called HappyDB can be downloaded from GitHub). I did this using a recurrent neural network with LSTM units (using this Tensorflow code with some minor modifications). Some years ago, this would have been a major feat, but nowadays it’s a routine task in machine learning (see Botnik or AI Weirdness for lots of great – and funny – examples).

Clearly, an AI with feelings needs to have a name, and the one I came up with was hAIppy (yeah, sorry). Training finished, and it was time for hAIppy to tell me about one of its happy moments. At first, the sentences it wrote were far too long, like this:
I was able to go to the park with my family and I was able to get a new job that I was able to get to see my family and I was able to get a new job that I was able to get to see my family and I was able to get a new job that I was able to get a lot of fun.

After I’d fiddled with the settings to fix that, hAIppy started repeating itself:
I was happy when I was able to spend time with my family and I was able to get a new car for the first time in a long time.
I was happy when I was able to spend time with my family and I was able to get a new car for the first time in a long time.
I was happy when I was able to spend time with my family and I was able to get a new car for the first time in a long time.
I was happy when I was able to spend time with my family and I was able to get a new car for the first time in a long time.

So, I added some rules to keep it from saying the same thing over and over again.

In the end, the results were pretty good. The neural network was able to create novel descriptions of moments of happiness, and most of the time they made perfect sense. Most messages involved family and/or food, so it clearly picked up on the fact that those things are very important for our happiness.

But sometimes the messages made less sense. Sometimes hAIppy decided to become a little philosophical. And sometimes, it went to dark places, as you can see in this example of what a conversation between hAIppy and myself (in which all typewriter font text was generated by hAIppy) could look like:

-When were you happy, hAIppy?
-I was happy that I was able to get out of work I went to the park with my husband and we had a great time.
-That’s nice. And lovely to hear that you’re married! I had no idea. Let’s try this again. When were you happy?
-I watched a movie with my girlfriend and we had a great time.
-Right, right, wonderful… So, tell me about another time you were happy.
-I was happy when I was able to get a new bandsaw, to see the movie and I was very happy when I got to see my grandson.
-A bandsaw…? Well, I’m sure you’ll find some use for that. When else were you happy?
-I was happy when I was able to get a new bandsaw, to create the tape recorder was a very happy moment for me as it is something I was looking forward to achieving and the news made my parents quite happy and proud.
-You built a tape recorder using a bandsaw?! I’m impressed, to say the least..! And good to hear that it made your parents proud. One more time: when were you happy?
-I was happy when I found out that my favorite musician, Juliana Hatfield, is putting me a hug and said: "Daddy, you're back!"
-You’re the father of American musician Juliana Hatfield? Were you making the tape recorder in order to tape her music then? Interesting! I feel like I’m really getting to know you. When else were you happy?
-I was happy when I was able to get a new bandsaw, to see the movie and I was very happy to see the attention directed elsewhere.
-Well, I for one would appreciate it if you directed your attention away from the bandsaw… When else were you happy?
-striven Wars fluff dryer
-What. I don’t even. When else?
-Finished what? Your tape recorder? And why are you screaming at me? Jeez. Please calm down and tell me about when you were happy instead.
-You leave life in the present? hAIppy, is this a call for help? And why are you still screaming? Can you just tell me about a happy moment, please?
-I do a difficult job at work.
-Join the club, my friend. I’m trying to get a bot to tell me about when it was happy, but it just keeps spewing nonsense at me. Let’s try again.
-The pursuit of happiness is such little task.
-Yeah, trivial, isn’t it…? I’m sure you must’ve been happy loads of times then, so tell me about a time when you were happy.
-We bought a mini zoo.
-Wow. Sounds lovely! And expensive. You must really like animals. Tell me about some other times you were happy.
-I got a free coffee at the store. I was happy when I was a kid. I was happy when I splurged for dessert. I went paddleboarding in an old apartment in NYC. I had sex with my friend whom I haven't seen. My wife came to my fraternity to make breakfast. I played the basketball game because I was at a prestigious festival. I finished a few books at the beginning of the closing of the year. Conducted another theme morning with my mom. We fed this made me happy. At night I got some of my 17 cats at the grocery store I found. I bought a new cat from an online cafe.
-Sounds like you’ve had a busy week! One last time now: tell me about the last time you were happy.
-After I cooked a really good cat.
-You what?! I’m shutting you down, hAIppy.
-I was able to get a new bandsaw, to create my favorite sports team.
-Shutting. You. Down.

There we go. hAIppy likes family, food and cats (and apparently cats=food at times), and has a strange fascination with bandsaws.


Why bandsaws matter

All of this was just a bit of fun, of course. Teaching AI about emotions and what’s important to us is in reality going to be a much much greater challenge. But maybe there are a couple of things to learn from hAIppy. While it’s obsession with bandsaws is amusing, it also serves as a reminder that sometimes machine learning models pick up on the wrong things in the data. In fact, I ran three different incarnations of hAIppy using different settings, and all three ended up with a fondness for bandsaws. Not dogs (mentioned in 2,649 moments), not rain (1,496 moments), not TV (814 moments), but bandsaws, mentioned in just one moment out of 100,000.

In this example, it becomes pretty clear to us that something has gone wrong with the network. But in other cases, like when black-box machine learning methods are used to diagnose diseases or determine whether a loan application is approved, problems like this may be difficult to spot. Just like hAIppy greatly overestimates the probability that someone uses the word bandsaw when accounting for why they are happy, a black-box model can overestimate the probability that, say, people with a rare ZIP code default on their loans. The output from the box may look normal on the surface (someone could of course mention bandsaws when explaining why they are happy), but in fact be completely wrong (most people don’t obsess about how great things are now they have a new bandsaw, so neither should our model) – and as with the bandsaw we can only find out that something is wrong when a pattern appears. Mentioning a bandsaw once is perfectly fine, but mentioning it over and over again means that we have a problem.

So when next someone shows you the output from a machine learning model, ask them: where are your bandsaws?

Answering this question is not always easy. Far from it – in fact, in more serious applications, such as medical or financial problems, spotting the bandsaws in the model output is often very difficult. Many machine learning methods, including neural networks, are inherently opaque, and it is hard for us to tell why they made a certain prediction or decision. When those decisions are binary (disease/no disease, loan application approved/rejected), it is virtually impossible to detect when something’s gone wrong just from the output.

There are good reasons to look for bandsaws, too. Sometimes erroneous prediction  are obviously wrong – like when a model predicts that the temperature in Uppsala tomorrow will be 46°C. Others times the errors are much less obvious, and those are the errors we should be the most concerned about. Far too often, we are content with reporting mean squared errors, sensitivity or other metrics of how well a model is does predictions. Those measures are useful, but they don’t tell the whole story. They don’t tell us when we have repeated non-obvious errors (like hAIppy’s bandsaws) in our data, or what those errors are. And therefore, they don’t tell us much about what we should do in order to improve our model.

There are ways to looks for bandsaws, of course. In the above conversation with hAIppy, there are moments where the model suddenly ends up in parts of the sample space (in this case – the set of all words that hAIppy has learned), that it hasn’t explored well enough to make educated guesses of how to continue the sentence. This is what leads to statements like striven Wars fluff dryer. And too often it ends up in a part of the sample space where the best guess for what the next word in the sentence should be is bandsaw. In other applications, this type of problems can be diagnosed by assessing how similar the input to the model is to data in the training set – if a lot of the new data is different from the data that the model has been trained on, there is an increased risk of poor predictions. You can do this using measures like the Mahalanobis distance or by plotting a bubble chart where the first two principal components of your data make up the two axes, bubble colours show whether an observation is from the test or training set and the bubble sizes represent prediction errors. Another interesting option is to run an unsupervised cluster analysis on the test dataset and compare the performance of the model on the different clusters: if you find a cluster for which the performance is significantly worse you may well be on the bandsaws’ trail. And there are many other options for exploring what’s inside of your black box (here, here, here and here).

The point I’m trying to make here is this: you can’t tell from loss functions and common performance measures whether your black-box model has developed an obsession with woodworking machines. Simply looking at those numbers is not enough. To find its bandsaws, you need to run additional experiments on your trained model. Look for repeated mistakes and part of the sample space where it performs poorly. That will tell you where it needs to improve.


Transparency is important – for two reasons

Recently, there have been calls (and rightly so!) for accountability and transparency when machine learning is used – if your loan application is rejected by an algorithm, you should have the right to know why.

I agree with the need for this: when machines and algorithms rule our lives, we need insight into how they make their decisions. But for those of use who build the machine learning models there is another reason for trying to open up the black box: it will help us find the bandsaws – the erroneous decisions that look perfectly normal – and thereby help us improve our models. And when you get rid of the bandsaws, you can kick back, relax, and be happy – with family, food… or your very own mini zoo.

Analyse Bioscreen growth curves using BAT 2.0

Some years ago, I developed an R script for analysing growth curves from Bioscreen C MBR machines. The script, called BAT (an acronym for Bioscreen Analysis Tool), has been used to analyse Bioscreen data in several labs across the globe – for instance in the work leading up to our paper on reversion of antibiotic resistance, published last year. It allows the user to quickly compute the growth rates and doubling times in hundreds of growth curves.

BAT 1.0 worked just fine most of the time, but required the user to learn a little bit of R. That threshold was too high for some users, and so I’ve thought about creating a more user-friendly version of the script for a while now. I’ve finally gotten around to doing just that, creating a version that can be run directly in any modern browser, by using a wonderful tool called Shiny. Users can now use BAT to analyse Bioscreen growth curves in their web browsers, without installing or knowing how to use R. It’s free, and always will be. And so, without further ado, I present to you:

BAT 2.0

Still reading? Well, in that case, I’m happy to tell you a little bit more about what BAT 2.0 does and how to use it.

Some background
Bioscreen machines, manufactured by Oy Growth Curves Ab Ltd, can be used to measure the growth of all kinds of microorganisms – bacteria, amoebae, fungi, yeast, algea, bacteriophages and so on – in different media (BAT was originally developed to analyse bacterial growth, which may be evident in some of the terminology used below). It measures growth in terms of changes in optical density (OD). Growth curves are what we get when we plot the growth against time, as in this figure, showing the growth of E. coli in a rich medium (Müller-Hinton broth):

Notice that I’ve plotted the logarithm of the OD along the y-axis, which allows us to visually identify the log phase of the bacterial growth as a straight line (shown in red). By fitting a straight line to the log phase, we can find the doubling time of the growth. So how can we use BAT to plot curves like this and fit lines, and to find growth rates and doubling times?

Tab: Data
When you start BAT 2.0 you’ll see the following:

Notice that there are several tabs in the menu. When using BAT, we always start with the leftmost tab (Data) and gradually work our way further right (all the way to Results).

The first thing to do is to upload the csv file containing the results from your Bioscreen run.

You can think of csv files as a mixture between text files and Excel spreadsheets. To view your data, you can open them in Excel, but LibreOffice Calc is much better at handling them.

Unfortunately, csv files from different machines don’t always look the same. First of all, the character encoding will differ depending on your computer (language settings, operating system, etc.). Second, the different columns of the csv file can be separated either by a comma, a semicolon or a tab. In order for BAT to read the file properly, you need to tell it the encoding of your file, and how it is separated. Once you’ve given BAT the right settings, the first few lines of your file will be displayed in the right panel.

Here’s an example csv file that you can try: a comma-separated csv file encoded with UTF-8.

So, how do you find out what encoding your file uses? If you’re a Linux user, you can open a terminal and type

cd ~/path/to/your/file
file -i

to find the encoding, and if you’re using macOS, open a terminal and type

cd ~/path/to/your/file
file -I

If you’re using Windows, have a look here instead.

Is your file in an encoding not included in the list in BAT? You can either try changing the encoding or send me an email, so that I can add it to the list.

Once your data is shown correctly to the right, as in the figure below, you’re ready to proceed to the next tab.

Tab: Blank wells
‘Blank’ wells, i.e. wells only containing the medium, can (and should!) be used to adjust for changes in OD not caused by microbial growth. Check the boxes corresponding to the blank wells on your plate. Here’s an example of what your blank wells can look like:

If you don’t have any blank wells, uncheck all boxes and move to the next tab, where you then will need to select No adjustment (expect to see error messages in red prior to checking the No adjustment box – that’s normal and nothing to worry about).

(If you are using the example csv file, wells 141-144 should be blank.)

Tab: Reference wells
The first thing to do on this tab is to select which method to use for using the blank wells to adjust the OD measurements for the remaining wells:

  • Method 1: the average OD of the blank wells during the entire run is subtracted from the OD measurements.
  • Method 2:  the average OD of the blank wells at each time point is subtracted from the OD measurements at that same time point. This is the recommended method, as it captures the changes in OD not due to microbial growth at each time point.
  • No adjustment: no adjustments are made. Use at your own risk.

After that, you should select the wells containing your reference strain. Relative growth rates will be computed with respect to this strain. If you don’t have a reference strain and only are interested in absolute growth rate (i.e. the doubling times), simply pick one of your strains to use as a reference.

Tab: Other wells
BAT is now almost ready to compute the growth rates in your wells. It will attempt to find the log phase, but you’ll need to provide a little more information first: where the log phase can be found, how many replicates of each strain you have and what you consider to be an acceptable fit:

  • Masking intervall: this sets the OD levels between which BAT will look for the log phase. In the plots to the right, this is the interval between the blue lines. Use the default settings or adjust them until the interval in the plots to the right seems appropriate for most wells.
  • Number of replicates: if you have replicates of your strains, you can set the number of replicates here. Replicates are assumed to be located in consecutive wells (e.g. wells 105, 106 and 107). BAT will then compute the mean and standard deviation of the growth rate of the strain.
  • Minimum R value: the R value is the linear correlation between time and log(OD) for the fitted line. If the R value for the line fitted in the masking interval is less than the value you set here, BAT will conclude that it’s failed to find the log phase, and will ask you to manually adjust the settings for that particular well in the next tab.
    In theory, the R value should be 1 during the log phase, but in practice it is always lower because of small measurement errors. A lot of the time, you can expect R to be greater than 0.999 during the log phase, but for some runs there will be more noise, in which case you may need to set a lower minimum R value.

Here’s an example of what you should be seeing:

In addition to this, you can also choose which wells to analyse. The default is to analyse all wells – simply uncheck the boxes for the wells that you do not wish to analyse.

Once satisfied with your settings, click Save results and proceed to the next tab.

Tab: Manual fitting
If there were some wells for which BAT failed to fit a line with an acceptable R value, you will be asked to change the interval in which BAT looks for the log phase. You can do this either by choosing a vertical masking interval (where you pick which log(OD) values to look for the log phase among) or a horizontal masking interval (where you choose between which time points to look for the log phase). Your chosen interval is shown in the plot to the right. Once you’ve found an interval with a sufficiently high R value, click Save results.

Here’s an example showing a horizontal masking interval:

BAT will prompt you to repeat this procedure for all wells with too low an R value.

If you get stuck on a well and simply can’t find an interval where you get a sufficiently high R value, choose horizontal masking and put the two sliders very close, so that only two points are used for the estimate. Then R will always be 1 and you can continue (but don’t trust the growth rate estimate for that well!). Alternatively, you can go back to Other wells and uncheck the box for that well.

Tab: Results
To the right, you will see a table showing the results for your data. This includes the slope of the fitted lines (fittedValue), doubling times (doubTime) and relative growth rates (growthRate) for each well, as well as averages and standard deviations for each strain. The mean doubling time for the strain is called groupMeanDoubTime, the mean relative growth rate is called groupMeanGrowthRate and the standard deviation for the relative growth rate is called groupGrowthRateSD. Here’s an example of what the results may look like:

If you are satisfied with the results, you can also download this table as a csv file (comma-separated with UTF-8 encoding) and download the plots showing the growth curves and the fitted lines for all wells in a pdf file. If not, you can go back to previous tabs and change the settings.

If you use BAT for your research, please use the following citation:

Thulin, M. (2018). BAT: an online tool for analysing growth curves. Retrieved from

For frequent users
If you are planning on using BAT extensively, I highly recommend that you install it on your computer rather than running the online version. It will still look the same and you’ll still use it in exactly the same way. There are two reasons for installing it:

  • BAT will run much faster on your computer.
  • BAT is currently hosted on’s free server, meaning that it’s limited to being using 25 hours per month. If you use it a lot, you’ll block others from using it.

To install and run BAT on your computer, follow these five easy steps:

  1. Download RStudio (choose the free desktop version) and install it.
  2. Download the source code for BAT from gitHub: click Clone or download and then Download ZIP. Unzip the downloaded file in a folder of your choice.
  3. Open the file app.R in RStudio.
  4. Click Run App.
  5. Click Show in browser.

You’ll then be running BAT in your browser just as before, only now it’s your computer rather than the server that’s doing the computations.

Frequently asked questions
Alright, I’ll admit it: since BAT 2.0 is a brand new thing, nobody has actually asked me these questions yet. But you ought to, so I’ll put them here just in case.

Q: BAT’s not working for me – what should I do?
A: First of all, carefully follow the steps outlined above. If it’s still not working, please send me an email at Attach your csv file and describe what the problem seems to be. I’ll see if I can help.

Q: I have growth curves measured by something else than a Bioscreen. Can I still use BAT to analyse them?
A: Absolutely. If you format your data in a csv file to look like the output from a Bioscreen, BAT can do the same type of analysis for your data.

Q: Can you add feature X or add option Y?
A: Maybe. Contact me at and we can talk about it!

Q: Can you make a custom BAT with features X and Y for my lab?
A: Possibly! Contact me at and we can talk about it.

Q: I know R and want to contribute feature X to BAT.
A: That’s a statement and not a question, but OK…! You can find the source code for BAT on gitHub. Feel free to play around with it, and if your feature X seems useful I’ll be happy to add it to the online version of BAT!

Q: I’ve run lots of Bioscreens and analysed them using BAT. I have so much data now! But I’m not sure how to proceed with further statistical analyses of it. What should I do?
A: Great to hear that you have so much data! As it happens, I’m a statistical consultant with a lot of experience from growth curve data. Why don’t you send me an email at, to see if I can help?

Q: So, I hear that you’re a statistical consultant… I have all these other data that I also need help analysing!
A: Also a statement rather than a question, but never mind… Email me at with a description of the data you have, how you have collected it and – most importantly – why you have collected it (what is the question that you are trying to answer?). If I’m able to help, I’ll get back to you with a quote.

Speeding up R computations Pt III: parallelization

In two previous posts, I have written about how you can speed up your R computations either by using strange notation and non-standard functions or by compiling your code. Last year my department bought a 64-core computational server, which allowed me to get some seriously improved performance by using parallelization. This post is about four insights that I’ve gained from my experiments with parallel programming in R. These are, I presume, pretty basic and well-known, but they are things that I didn’t know before and at least the last two of them came as something of a surprise to me.

I’ll start this post by a very brief introduction to parallelization using the doMC package in R, in which I introduce a toy example with matrix inversions. I will however not go into detail about how to use the doMC package.  If you already are familiar with doMC, you’ll probably still want to glance at the matrix inversion functions before you skip ahead to the four points about what I’ve learned from my experiments, as these functions will occur again later in the text.

Parallel programming in R
Modern processors have several cores, which essentially are independent CPUs that can work simultaneously. When your code is parallelized several computations are carried out in parallel on the different cores. This allows you to, for instance, run several steps of a for-loop in parallel. In R it is simple to do this using e.g. the doMC package.

As an example, imagine that you wish to generate and invert several random 200 by 200 matrices and save the inverted matrices in a list. Using a standard for loop, you can do this using the following function:

# A standard function:
<-vector("list", B)
for(i in 1:B)

Using the doMC package, we can write a parallelized version of this function as follows:

library(doMC) # Use doMC package
registerDoMC() # Determine number of cores to use, default = #cores/2


# A parallelized function:
    foreach(i = 1:B) %dopar% 


The parallelized foreach %dopar% loops always returns a list. If you are used to using for-loops in which the output is manually stored in vectors or matrices, this may take some time getting used to. The list structure is actually very useful, and by using unlist, data.frame and Reduce in a clever way you probably won’t have to make a lot of changes in your old code.

Comparing the two functions, we find that the parallelized function is substantially faster:

>  system.time(myFunc(1000,200))
   user  system elapsed 
 19.026   0.148  19.186 
>  system.time(myParFunc(1000,200))
   user  system elapsed 
 24.982   8.448   5.576

The improvement is even greater for larger matrices. Parallelization can save you a lot of time! Unfortunately, there are situations where it actually makes your code slower, as we will see next.

1. Parallelization is only useful when you have computation-heavy loop steps
Each parallelization has an overhead cost in that it takes some time to assign the different loop steps to the different cores. If each step of your loops is quickly executed, the overhead cost might actually be larger than the gain from the parallelization! As an example, if you wish to generate and invert several small matrices, say of size 25 by 25, the overhead cost is too large for parallelization to be of any use:

>  system.time(myFunc(1000,25))
   user  system elapsed 
  0.264   0.004   0.266 
>  system.time(myParFunc(1000,25))
   user  system elapsed 
  0.972   1.480   1.137

In this case, the parallelized function is much slower than the standard unparallelized version. Parallelization won’t magically make all your loops faster!

2. Only parallelize the outer loops
If you are using nested loops, you should probably refuse the urge to replace all your for-loops by parallelized loops. If you have parallelized loops within parallelized loops, you tend to lose much of the gain from the outer parallelization, as overhead costs increase. This is a special case of point 1 above: it is probably the entire inner loop that makes each step of the outer loop computation-heavy, rather than a single step in the inner loop.

Returning to the matrix inversion example, let’s now assume that we wish to use an inner loop to generate a new type of random matrices by multiplying with other matrices with uniformly distributed elements. The unparallelized and parallelized versions of this are as follows:

# No parallelization:
   a<-vector("list", B)
   for(i in 1:B)
    for(j in 1:B)
        myMatrix<-myMatrix %*% matrix(runif(size^2),size,size)


# Outer loop parallelized:
    foreach(i = 1:B) %dopar% 
        for(j in 1:B)
         myMatrix<-myMatrix %*% matrix(runif(size^2),size,size)


# Both loops parallelized:
    foreach(i = 1:B) %dopar% 
        myMatrix<-myMatrix %*% Reduce("%*%",foreach(j = 1:B) %dopar%

When the outer loop is parallelized, the performance is much better than when it isn’t:

>  system.time(myFunc(1000,25))
   user  system elapsed 
 89.305   0.012  89.381 
>  system.time(myParFunc(1000,25))
   user  system elapsed 
 46.767   0.780   5.239 

However, parallelizing the inner loop as well slows down the execution:

>   system.time(myParFunc2(1000,25))
   user  system elapsed 
970.809 428.371  51.302 

If you have several levels of nested loops, it may be beneficial to parallelize the two or three outmost loops. How far you should parallelize depends on what is going on inside each loop. The best way to find out how far you should go is to simply try different levels of parallelization.

3. More is not always merrier when it comes to cores
Using a larger number of cores means spending more time on assigning tasks to different cores. When I tested the code for a large-scale simulation study by using different number of cores on our 64-core number cruncher, it turned out that the code executed much faster when I used only 16 cores instead of 32 or 64. Using 16 cores, I might add, was however definitely faster than using 8 cores. Then again, in another simulation, using 32 cores turned out to be faster than using 16. Less is not always merrier either.

An example where a larger number of cores slows down the execution is the following, which is a modified version of myParFunc2 above, in which the multiplication has been replaced by addition to avoid issues with singular matrices:

    foreach(i = 1:B) %dopar% 
        myMatrix<-myMatrix %*% Reduce("+",foreach(j = 1:B) %dopar%


> registerDoMC(cores=16)
> system.time(myParFunc3(1000,75))
    user   system  elapsed 
2095.571 1340.748  137.397


> registerDoMC(cores=32)
> system.time(myParFunc3(1000,75))
    user   system  elapsed 
2767.289 2336.450  104.773


> registerDoMC(cores=64)
> system.time(myParFunc3(1000,75))
    user   system  elapsed 
2873.555 3522.984  112.809

In this example, using 32 cores if faster than using either 16 or 64 cores. It pays off to make a small trial in order to investigate how many cores you should use for a particular parallelized function!

4. Make fair comparisons to unparallelized code
The doMC package includes the option of running foreach-loops without parallelization by simply typing %do% instead of %dopar%. This is nice as it lets you compare the performance of parallelized and unparallelized loops without having to rewrite the code too much. It is however important to know that the %do% loop behaves differently than the standard for loop that you probably would use if you weren’t going to parallelize your code. An example is given by the matrix-inversion functions from point 1 above. The %do% function for this is

# A standard (?) function:
    foreach(i = 1:B) %do% 

But this function is noticeably slower than the standard function described above:

>  system.time(myFunc(1000,25))
   user  system elapsed 
  0.264   0.004   0.266
>  system.time(myParFunc2(1000,25))
   user  system elapsed 
  0.952   0.044   0.994

When comparing parallelized and unparallelized loops, avoid %doloops, as these make the parallelized loops seem better than they actually are.

I’m honestly not sure why there is such a large discrepancy between the two unparallelized for-loops. I posted a question on this matter on Stack Overflow last year, but to date it has not received any satisfactory answers. If you happen to know wherein the difference lies, I’d much appreciate if you would let me know.

That’s all for now. These are my four insights from playing around with parallelization in R. It is entirely possible that they contradict some well-known principles of parallel programming – I confess that I am almost completely ignorant about the literature on parallelization. Please let me know in the comments if you disagree with any of the points above or if you have counterexamples

For more on parallelization in R, check out R bloggers!

doMc is only one of many many packages for parallelization in R.

A reply to Testing via credible sets

Last week I posted a manuscript on arXiv entitled On decision-theoretic justifications for Bayesian hypothesis testing through credible sets. A few days later, a discussion of it appeared on Xi’ans’ Og. I’ve read papers and books by Christian Robert with great interest and have been a follower of his “Og” for quite some time, and so was honoured and excited when he chose to blog about my work. I posted a comment to his blog post, but for some reason or other it has not yet appeared on the site. I figured that I’d share my thoughts on his comments here on my own blog for the time being.

The main goal of the paper was to discuss decision-theoretic justifications for testing the point-null hypothesis Θ0={θ0} against the alternative Θ1={θ: θ≠θ0} using credible sets. In this test procedure, Θ0 is rejected if θis not in the credible set. This is not the standard solution to the problem, but certainly not uncommon (I list several examples in the introduction to the paper). Tests of composite hypotheses are also discussed.

Judging from his blog post, Xi’an is not exactly in love with the manuscript. (Hmph! What does he know about Bayesian decision theory anyway? It’s not like he wrote the book on… oh, wait.) To some extent however, I think that his criticism is due to a misunderstanding.

Before we get to the misunderstanding though: Xi’an starts out by saying that he doesn’t like point-null hypothesis testing, so the prior probability that he would like it was perhaps not that great. I’m not crazy about point-null hypotheses either, but the fact remains that they are used a lot in practice and that there are situations where they are very natural. Xi’an himself gives a few such examples in Section 5.2.4 of The Bayesian Choice, as do Berger and Delampady (1987).

What is not all that natural, however, is the standard Bayesian solution to point-null hypothesis testing. It requires a prior with a mass on θ0, which seems like a very artificial construct to me. Apart from leading to such complications as Lindley’s paradox, it leads to very partial priors. Casella and Berger (1987, Section 4) give an example where the seemingly impartial prior probabilities P(θ0)=1/2 and P(Θ1)=1/2 actually yield a test with strong bias towards the null hypothesis. One therefore has to be extremely careful when applying the standard tests of point-null hypotheses, and carefully think about what the point-mass really means and how it affects the conclusions.

Tests based on credible sets, on the other hand, allows us to use a nice continuous prior for θ. It can, unlike the prior used in the standard solution, be non-informative. As for informative priors, it is often easier to construct a continuous prior based on expert opinion than it is to construct a mixed prior.

Theorem 2 of my paper presents a weighted 0-1-type loss function that leads to the acceptance region being the central (symmetric) credible interval. The prior distribution is assumed to be continuous, with no point-mass in θ0. The loss is constructed using directional conclusions, meaning that when θ0 is rejected, it is rejected in favour of either {θ: θ<θ0} or {θ: θ>θ0}, instead of simply being rejected in favour of {θ: θ≠θ0}. Indeed, this is how credible and confidence intervals are used in practice: if θis smaller than all values in the interval, then θis rejected and we conclude that θ>θ0. The theorem shows that tests based on central intervals can be viewed as a solution to the directional three-decision problem – a solution that does not require a point-mass for the null hypothesis. I therefore do not agree with Xi’an’s comment that “[tests using credible sets] cannot bypass the introduction of a prior mass on Θ0“. While a test traditionally only has one way to reject the null hypothesis, allowing two different directions in which Θcan be rejected seems perfectly reasonable for the point-null problem.

Regarding this test, Xi’an writes that it essentially [is] a composition of two one-sided tests, […], so even at this face-value level, I do not find the result that convincing”. But any (?) two-sided test can be said to be a composition of two one-sided tests (and therefore implicitly includes a directional conclusion), so I’m not sure why he regards it as a reason to remain unconvinced about the validity of the result.

As for the misunderstanding, Theorem 3 of the paper deals with one-sided hypothesis tests. It was not meant as an attempt to solve the problem of testing point-null hypotheses, but rather to show how credible sets can be used to test composite hypotheses – as was Theorem 4. Xi’an’s main criticism of the paper seems to be that the tests in Theorems 3 and 4 fail for point-null hypotheses, but they were never meant to be used for such hypotheses in the first place. After reading his comments, I realized that this might not have been perfectly clear in the first draft of the paper. In particular, the abstract seemed to imply that the paper only dealt with point-null hypotheses, which is not the case. In the submitted version (not yet uploaded to arXiv), I’ve tried to make the fact that both point-null and composite hypotheses are studied clearer.

There are certainly reasons to question the use of credible sets for testing, chief among them being that the evidence against Θis evaluated in a roundabout way. On the other hand, credible sets are reasonably easy to compute and tend to have favourable properties in frequentist analysis. It seems to me that a statistician that would like to use a method that is reasonable both in Bayesian and frequentist inference would want to consider tests based on credible sets.