Tuesday 3 May 2011

Have we passed the prediction peak? The Fate of Theory in the In-Silvo Era



As a student, my only dream was to solve all the grand challenges of modern biology with a computer. I confess this ambition does not really set me apart from most bioinformaticists. I also confess my success does not set me apart either. I was utterly convinced that protein folding was merely a few clever equations away, and that upgrading my 200 Mhz Pentium processor would do the trick. In my young and innocent mind there was not the shred of a doubt that gene prediction would be cracked whenever we have the right weights for the right neural network. Like everybody, I gasped when Hidden Markov Models were introduced, fearing that these devilishly clever mathematical creatures would not leave a single intact question, gulping everything that was keeping me entertained, from gene prediction to protein folding. The truth is that in these days computer were slow enough so that you could daydream while they were running (I am talking about a time when an E-mail could take two days to go across the Atlantic). Many of these pipe-dreams materialized into pipe lines of limited use… We had a lot of fun though …

The first turn-off, for many of us, was protein-protein interaction. Gosh! Suddenly our clever predictive contributions were made redundant- and by what? An ancient (5 years old!) wet lab technique coupled with robots. A perfect humiliation. Oh we did not like it! How many coffee breaks spent complaining on the low accuracy, the low specificity, the taste, the color of massive two hybrids. Few of us realized at that time (I did not) that we were merely resisting the end of an era. The end of scarce data. The end of in-silico biology. I am not suggesting in-silico biology has been a failure, but we clearly got it off the mark on many things we were so excited about. In the vision most of us shared at that time, experiments would gradually become redundant, replaced with sophisticated equations sets -readouts at the tip of the return key. In reality, exactly the opposite happened. The experimental output of biology has been increasing by several orders of magnitude in less than 5 years. Add 3 zeros to your pay check and think how life would change…

Indeed while many of us were writing longer equations and shoving extra-states into Hidden Markov Models, biologists were eagerly jumping on the band-wagon of the most formidable technological shift since cloning: high throughput sequencing. They had simply become tired of waiting for us and had stumbled on this surprising finding: with the right equipment, generating data is easier than generating models. Before we could take our breath, the biology we had lived in was gone. Today, 90% of the labs I know buy 90% of their data. They do it from their core and will soon get everything from Amazoogle. In many areas of biology, data has stopped being a limiting factor. In all other areas, scientists are struggling to reformulate their questions and fit them into the NGS framework. Ironically, many of us doing computational biology have not yet realized what has happened. Indeed, on an everyday basis the job is still the same: same computers, same algorithms, same questions. The first hint I received that things had changed is when our sys admin suggested building a mid-size nuclear plant to power the new file-system.  The second hint was my wet-lab neighbors leaving 2 TB USB discs on my desk with a post-it note saying “Can you have a look?”. I did, I am doing this every day now. But to be honest, not so much has changed for me. It is just that the magic code I used to run, chewing predictions and turning them into yet more useless predictions is now over-fed with real data and is churning out experiment-based models. You realize the difference it makes when suddenly you are allowed to send your computer output to respectable journals, those that once made it a policy to reject any result “merely based on computational analysis”.

I call this in-silvo biology, that is to say in-silico biology domesticated to deal with data collected in-vivo. Some will argue that once we have enough data we can turn biology into theoretical physics. That may be so, but we will need to find a way to curve the impatience of biologists (a pill?) and this obsession they have that fresh data can cure every problem, from cancer to tenure track. In my humble opinion there is no way back to theory (at least for a while) and if you want to convince yourself, do what I just did, run google scholar for “predictions+biology”, and you will find that the prediction peak was back in 2006. If you believe -as I do- that Popper cannot be falsified then you can only conclude things won’t look too good for theory, at least for a while. Why guess when you can look? -said the blind man to the short sighted …

Cédric Notredame