Wednesday, 26 July 2017

Save Our Silos - A case against forced interdisciplinarity

More collaborations! We have to destroy data silos! Walls must be torn down! Barriers removed!!! Now that we have just entered the Trump era, my innate sense of opposition makes it hard to resist the open-field ideology. After all, the orange clown wants the contrary, does he not? If Batman’s nemesis wants to build walls, shall we not invest in Semtex? Still, my experience is that any rhetoric easy to flow is usually flawed. Especially when it is hard to disagree with. Think of Miss Universe sobbing her “peace in the world” usual wish - who would even think of disagreeing? And who would give it a second thought… Unfortunately, Trump does not hold the monopoly for stupid ideas. That would be too easy...


So, is full blown collaboration good for research? Well, given that you are reading this blog and that some of you have, hopefully, read some of my papers, and the other way round, I guess the answer is yes. Collaboration and exchanges are good, not only good, they are essential for good research. That does not need to be proven because it is a condition for survival. Period. Now the remaining question of course is, how much of it? Does every project has to be a collaboration, 50%, 10%? Is there a magic number, does it depend on people, places, time of the day, moon phases, Chinese horoscope year?


In a very successful southern institute that I was recently surveying, and whose three letter name I will keep to myself, five lines of perl chomping over a scopus dump revealed a staggering 80% level of collaboration between programs. Yet some of the researchers were complaining about the management recurrent mantra on “insufficient collaborations, and drastic steps that should be implemented to increase interdisciplinarity and collaboration between programs”. When all the buzzwords pop up in the same sentence, I usually sense red light warnings sweeping through my neurones like measle patterns. I apologize for being so biased against the obvious. It’s part of my job.


The thing with collaborations is that they are easy to explain to politicians, and, let us be clear about it, collaborations lead to harmonization that leads to improved productivity, that leads to increased wealth. Politicians, and statesmen alike, understand this well and it is hard to curb their enthusiasm at the prospect of being re-elected for successful economic reforms. If you are in production mode you want everything to be harmonized, and walls to be destroyed. This is why big companies become even bigger - there is a big corpus of theory behind this and everybody agrees it makes a lot of sense for all the stakeholders even - sometimes - the consumers.


Only kidding… I have to admit that making better drugs, better phones and better everything is, indeed, a prospect I find it hard to frown upon. I like engineers to do good stuff. Over the years, managers have used all their linguistic skills to have us calling this innovation. A new phone is innovation, a new drug is innovation, a new car color is innovation. You name it.. No! Not even! you don’t have time to name it and it’s already innovation. Through an ingenious semantic shift these smart innovation leaders (i.e. those who name innovation faster than the rest) have even lead us to believe in corporate research, the (in)famous r&d. Does it work? Does r&d really create novelty, and if it does, is it the result of increased collaboration and communication?


Well, let's take the simple analogy of cinema. On this page (http://www.filmsite.org/boxoffice2.html) are listed the top grossing movies across the whole 20th century. I have picked up two decades, the 60s and ours:




We could argue a long time about the fact that the old flix were more creative, etc, etc and any one could handwave his or her own way, but there is something that can simply not be denied. The 2010 decade is entirely dominated by sequels: Star Wars 8 (!) Iron Man 3, Toy Story 3, Furious 7, etc, etc. Overall, out of the 10 most successful movies of the current decade, all are sequels, 9 of which with an index higher than 2...  Just have a look at 60s for a comparison. Not one frame of repackaged stuff! All these movies were brand new bold ideas. Mary Poppins, my fair lady, Dr Zhivagho. Not a single one that most cinephiles would not consider a small jewel in its own category. Very good movies in which most characters have the elegance of dying in the end as both a token of respect for real life, and a gesture of defiance for sequels.


Yes, not only have things become much more expensive, but the level of creativity has dramatically fallen. This sort of things happens when you break all walls, get everybody in the same big swimming pool and start heating things up, you make soup, and not a good one. It seems that breaking walls works much better when it comes to manufacturing goods than to foster diversity. Diversity and uniformisation do not really go together. If you want to convince yourself, just go on a field trip in your favorite tropical spot, put on a snorkel and look! Where are the rainbow fish, Nemo and his friends or this weird seven and half legs octopus munching on an anemone? In the sandy open space, waiting to be chomped? or in the intricate collections of caves, tunnels and chambers carved by coral.. Of course sharks have long argued that these divisions should disappear and give way to a more rational organization of their food supply but ...Where does diversity thrive? In the open or in the fragmented? Then again, where would you set to cultivate your oysters? Here you go. Productivity versus variety.


Of course, politicians could have asked geneticists. They would have told them that in a population, the probability for a new mutation to get fixed is proportional to Ne, the effective population size (https://en.wikipedia.org/wiki/Fixation_(population_genetics).


What this means is that the larger your population, the harder it becomes to create stable diversity. If you want diversity to arise and make its way, you need many small independent populations. Surprising but obvious if you think of it a minute. Just take your five favorite small countries and compare their combined diversity - of any kind - with any larger country having their combined population size… Think Europe versus the US…




Of course, said this way, it is hard not to long for US style uniformity, especially when you have a computer to plug... but now let us switch to your google life. You wake up in the morning. You have had this weird idea in the shower about CRISP-R re-engineering of sabertooth kangaroo and your spine is shivering at the translational prospects of this innovative project. You are already putting together your address to the nobel committee, with tongue in the cheek bundling of a few subliminal messages to Trump and Brexit. But wait!!! you google a bit and find that some second rate scientist, from some obscure university, has stolen your brainwave before you even had it! Adding misery to injury, the dude has made a mess of it and published it in the annals of improbable marsupials. Thunder is gone, it’s only mist and steam from the shower you left running... back to reality and well structured work, take off this bow tie.  


Had you not known about it, you may have polished your idea, possibly re-inventing the wheel, possibly finding an alternative solution that would have resulted in new possibilities. At best you would have been eaten alive by your creature, but most likely would have wasted your time, or, with a very very low probability, you would have made an amazing breakthrough and changed the face of kangaroos forever.  


Alas… Now that the law states that the tires of re-invented wheels must be punched every second day of the week, working this way has become impossible. In our efficient world there is no room for redundant ideas. Yes, if brains were species, increased communication would decrease their effective population size. For many things, like wikipedia, this is just great. For others, like the emergence of novelty, it is the neuronal equivalent of the cretaceous extinction, a perfect ecological wipeout.


So what does this have to do with inter-disciplinary collaborations? Let’s put it this way: when you collaborate, you exchange and harmonize ideas, and your community effective size becomes larger. By many measures this is great as it allows bolder projects - like the human genome - and brings in new ideas, like speech recognition HMMs chewing human genomes.

Such cross-fertilizations constitute one of the engines of scientific progress. Yet, at the same time, larger communities make it harder for new ideas to emerge. Journals, reviewers, community, twitter. They all think they understand everything and make sure any attempt of novelty gets squashed in its early days. On average they are quite right… or are they? The thing with novelty is that it is not an average phenomenon, it is a spark that eludes any prediction. And who would care… Unfortunately, novelty turns out to be the other engine of scientific progress. And yes, with two engines going in opposite directions, well you need a strong cable, and you need to make sure each engine keeps pulling. This is why we need  a good healthy tension between globalization and fragmentation. One cannot go around claiming one of these is the solution.


So what shall we do? I think I have the right answer because my answer is not even an answer. In the Trump era, at a time when both the pro and the anti know what is good for everybody, one should be weary of simple solutions. The only thing that I remember from my history classes is that whenever some character with a mustache, a beard or feather on her cap claims he or she knows what will make everybody happier, anything between one and a hundred million people die. With very convincing and charismatic leaders, one could probably go a bit over this figure, and the future looks really bright if you support massive primate extinction… only kidding, this is not a primate thing and there is no reason other species should miss out on the fun...


So what is good for basic research? Difficult to tell. Basic research is a very fragile eco-system. It produces little, in a highly unpredictable way, but when it does, it changes everything. In biology, the two groundbreaking shifts about to re-shape our lives, and probably the genetic makeup of our species, can be traced to very specific, not so collaborative, and hard to fund project. One is the restriction enzymes that opened up the era of biotechnologies and the other one is the CRISP-R mechanism. No matter the amount of story telling later built upon these things, the first one had to do with a scientist so obsessed with restricted growth in bacteria that he ended up studying it with money allocated for a different purpose (read Arber’s most likely not edulcorated own account on this)... The other was the brainchild of a Spanish microbiologist inhabited enough by his trade to escape from the Almeria’s beach every once in while and check for his computer’s output. None of these things were really planned, none of these things were easy to fund, none of them were branded as interdisciplinary, none of them would be funded today. No milestones, no interdisciplinarity, no future. Only ultra self driven scientists.


When these stories are told at conferences, the big cheese usually chuckle, implying with a laugh that these undeniable exceptions can safely be ignored and that originality should give way for the grand plans of these great men. We should resist, quietly and stubbornly, as we should resist any oversold idea. But what kind of resistance? I do not want inter-disciplinarity to go away. In fact, as a scientist, I cannot even imagine my life without the excitement of new adventures in fields unknown to me. I love these escapades because they are complicated to organize. My first interdisciplinary affair was with Lausanne Social Science department, studying life trajectories as if they had been genomic sequences, later on, my group collaborated with Mar Diesrsen teaching mice how to swim
and our Nextflow recent production results from an immersion in the IT world. In fact, looking back, about everything I do put together is made of the unnaturally fitting of intellectual objects. None of these things were ever properly funded - you know, grants with milestones and deliverables.


So what’s wrong? Well it’s very simple: I do it because of the unstoppable urge to follow the few obsessions I have had for a long time as a scientist. I don’t care about the system, I only care about my internal drive. I have no major claim that anything useful will come out of my work, but I know that these are people like me who create novelty. Old style obsessed scientists with absolutely no interest for buzz words. I am not saying that this breed of mammal is the only thing we need - remember it is an eco-system. I am simply claiming that without them there will be no novelty, only engineering. Of course in a majority of cases, the value of these novelties will be useless, sometimes their thunder will be stolen by more agile members of the community - or not. But who really cares? as Francisco Mojica puts it, finding the pattern of conserved sequences on his computer screen was the happiest day of his scientific life. A tiny, highly intimate emotion to him, a change to come for mankind.






Saturday, 6 August 2016

T-Coffee Reloaded

The last time I looked it up - an hour ago - our original T-Coffee paper had 3602 citations on scopus. I used to think this was a lot, until Nature ran this story on the Kilimandjaro height of scientific publications. The most cited paper is at 300.000... It gives all these numbers some kind of perspective I guess. I am not a huge fan of modern metrics and I usually find difficult to stay awake in front of Hollywood blockbusters ranked by their box-office gross product, like Batman 25 or Superman 2^6, yet I tend to think popularity and quality are simply not correlated as opposed to being mutually exclusive. This rigidly self-enforced open mindedness allows me to consider my highly and poorly cited papers as equally good - or bad when I am not in the right mood…

How you get that kind of citation level for a method paper is not a straight road, in fact it is not a road at all, more of an accident and a fall or anything related to bumping your head in the dark and waking up in hospital. I thought it might be interesting  for younger scientists to get an idea on how these things happen and why no time should be wasted at planning them. The truth is that even a story I should know perfectly, like T-Coffee, turns out to be riddled with speculative patches when tracing back onto how things seem to have really happened.

For those who have no clue what T-Coffee does, il is a multiple sequence aligner. It means that it takes a bunch of biological sequences - typically proteins - that have evolved from a common ancestor by accumulating mutations, insertions and deletions. Aligning them involves putting in the same column - aligning - the amino acids (represented as letters) that were already present in the common ancestor, as shown in the picture below. The rest of the positions - those not homologous across all the sequences - get padded with null symbols (-) we call gaps - just like the ‘mind the gap’ in London’s tube. Said this way it looks pretty simple, but it turns out that it is one of these computational problems that cannot be solved exactly - period. Computer people call them NP-Complete. These problems are good fun because as far as solutions are concerned, anything goes, just like Niagara fall stunt contraptions. And trust me, over these last 20 years, anything has gone... It is hard to think of any optimisation algorithm - no matter how crazy it may sound - that has not been thrown in the face of the multiple alignment problem. From Simulated Annealing to Genetic Algorithm, Tabu Search and probably many more I have never heard of. T-Coffee is one of them. Why do we care so much about these multiple sequence alignments? Because they can be a useful starting point to infer most things that matter in Biology, from evolutionary trees down to enzyme active sites analysis. This explains why methods describing them are among the most cited - not only in Biology but in Science in general.
T-Coffee started with another multiple aligner named Dialign, or to be more precise an earlier paper by Burkhard Morgenstern, in PNAS, about gap penalty free alignments. It came out just when I was finishing my PhD at the European Bioinformatics Institute. I really liked Burkhard’s paper. I was especially impressed with the concept he named overlapping weights. I don’t want to go into anything technical here, but  these weights were smart because they allowed all the sequences to talk together while being aligned, for a tiny extra computation cost. I liked that and spent a few unsuccessful nights re-implementing the concept in a quick-and-dirty way. I failed and moved on with my main project of the time which was to get a genetic algorithm computing alignments through in-silico sexual activity (aka genetic algorithm). But the idea - I mean the Dialign idea - lingered on and four years later it was still in my mind when I eventually implemented T-Coffee, and combined Burkhard’s weights with the ClustalW progressive algorithm. Said this way it looks pretty straightforward, but things are a bit more complicated and my take on this has been a major source of - friendly - disagreement with Burkhart who insisted many times that the two approaches are very different.

If CRISPERism was to become a trend, this would be the exact opposite. Two scientists arguing to establish their non-paternity of a method - “we stole your ideas!” “No you did not and we will resist any attempt of you saying you did!”. Half kidding... the aligners world is very civilized. Of course, Burkhart has a few good points, especially when going down to the fine grain details, but it does not change anything to the fact that I had the overlapping weights in mind when designing T-Coffee. I find this a great showcase of how alternative realities can coexist - even (especially?) in science. And no, I am not attacking Led Zep. I want to believe they were acting in good faith.

Another thing that makes T-Coffee a very average research project is that it did not start as a shinny clinking idea that I would have had in my bathtub, or, worse, while writing a grant. Quite the opposite: T-Coffee was originally a bug. At the time I was evaluating alignments by comparing them with other alignments and I somehow messed up the file names and ended up running unintended comparisons. Readouts were very good, the kind of very good I find very suspicious as a PI. With such results it was either instant fame or else. Taking care of the else factor resulted in the usual degraded performance and shattered dreams of fame and Science Magazine covers.  I remember coming home that night on my red mopped - registered with diplomatic plates thanks to the EMBL international statute - and sadly chewing on my midnight kebbab. Scientific failure is never healthy - neither is success by the way, too much sulfites. On the following day I did the right thing. I insist on this because I do not recall doing the right thing very often in my life, but that day I did. I carefully traced back and figured out why it had looked so good for a while. It turned out that the suspiciously informative comparisons had been made against collections of pairwise alignments. It’s like taking all the sequences, aligning them two by two, and checking on the agreement against a full multiple sequence alignment. This is the precise moment T-Coffee was born and it has not changed much since then.

At the time it was not T-Coffee. It was called something obscure. When searching for a name, I could not resist a libertarian quick fix by coming out with the silliest acronyme I could fit. It became “Consistency Objective Function For alignmEnt Evaluation” the E was a bit of a cheat - I know and I don’t apologize. I then met in Greece, at the ISMB, with Des Higgins, my PhD supervisor. I would usually come to Des with crazy ideas of neural networks coupled with genetic algorithms. Des took advantage of these discussions to teach me everything I needed to learn as a future PhD supervisor:  “Do it!”, “Could be...”, “I don’t know”, “Have you looked it up in the literature?”. But on that special occasion his face lit up and he immediately liked it. Even-though he may have forgotten about this, his encouragements on that specific day remained the driving force of what was to become  a lonnnng and mostly unsuccessful project. This was the summer of 1997, T-Coffee was already a year old.

I had to finish my PhD in a rush because EMBL likes its PhD program to look efficient - as does the EU - and I wrote a big fuzzy paper that read more like the leftover of a not so smart but much luckier Galois. As one should have expected - I did not -  it was smoothly rejected by NAR and eventually published in Bioinformatics - the journal formerly known as CABIOS. I like to think that the paper acceptance had nothing to do with me harassing Barbara Cox - the secretary, or Chris Sander - the editor - who where my office neighbours. Then again there are many other good things I like to think about myself. In any case, the COFFEE paper that has now about a 130 citations is the ancestor of T-Coffee as acknowledged by 130 gourmet bioinformaticians. It contains most of the original ideas about the new ways of evaluating alignments we had come out with. If we all have a paper we are secretly more fond of - a one that we find more personal - then COFFEE would be mine.

The time had come to turn this idea into a usable aligner, which COFFEE was not. COFFEE was a way of evaluating alignments, we call this an objective function, but it did not tell you how to build the alignment. For this, I had been using a genetic algorithm, that was marginally faster that a return journey between Oxford and Cambridge after British Railway privatisation - in case all you know about British trains comes from Sherlock Holmes, well, time have changed... I really had to get something faster, and scavenging the ClustalW algorithm turned out to be the best option - sorry Julie. I spent my last months at the EBI coding that stuff. This was intense enough so that most people I knew in Cambridge thought I was already gone, or dead, or something. A comment by a friend,  on the day I was leaving England to defend my PhD in France, perfectly captured the whole process of wrapping up a PhD:  “I wonder if this patch of hair on your forehead will ever grow back”. It did not.

The new T-Coffee was a highly sophisticated piece of code - euphemism for terrible - and I used most of my time in Switzerland in the group of Philipp Bucher to recode it. Philipp - and his funding agency - had been under the impression I would come working for the Prosite database. Over the few months I spent there, not only did I not do anything on Prosite - except messing up a few hyperlinks - but I also cannibalised Philipps attention to give suggestions on T-Coffee. For instance, he is the one who came out with the idea of combining local and global alignments. Why is he not on the paper? The only reason I can think of is because the project took three more years, three more labs, and three more countries to finalize. By that time, I had entirely lost track of who had done what to whom and vice versa (what? lab book? Are we supposed to have one of these?). I regret it but find some consolation in the idea that no one will be dropped out of a nobel because of this.

I left Switzerland for England with a first version of T-Coffee that was happily allocating the whole amount of RAM memory available in the UK at that time. This original version of T-Coffee is kept in a secret vault and considered a national security hazard. I will say no more. Fortunately, I then joined the group of Jaap Heringa in Willy Taylor’s program at the NIMRC. There I got myself in the best possible environment to clean up T-Coffee. It’s a long time I have not visited Mill Hill and I am not sure what is left of it - I know that Paul Nurse’s Cricky ambitions have shuffled things there quite a bit - but at the time the MRC was a place to drink a lot of coffee in the afternoon, beer on site in the local pub, talk a lot and be a scientist the way you had dreamt of becoming a scientist - in an absurdly under-assuming way, entirely captured by Michael Green and his monkey-based protein models. It took a good more year, and there we were, Des, Jaap and myself, fiddling with the first manuscript draft. We had it very clear: give me Science Magazine or give me death.

As we all know, the cool thing with these big journals is that you get fast rejection and move on, but if you don’t get a quick rejection, things get exciting. So we got excited, but for the wrong reason. Indeed I had left the MRC the day after submission and had taken a new appointment in France, in the lab of Jean Michel Claverie. Two months after submission we still had not received any rejection letter and I was beginning to browse Champagne millesimes. Unfortunately, the rejection letter had simply enjoyed a wet British summer, resting on my former desk...Yes, you know who you are if you read this, but thanks anyway for giving us so much hope and expectation over an entire summer...

This letter has long been lost and is not part of the file that I have posted here, neither is the Nature rejection. I seem to remember we tried Nature but I honestly could not find any trace of any failed attempt… Then again I do not have enough storage space for all my rejection letters... Our next best step was to be PNAS. And there things got hairy. It is a long time I have not submitted to PNAS, but in these days it was horrendous. You had to print things on american sized paper - a rare commodity in Europe. Then you had to use an esoteric formula to estimate your word counts while measuring figure sizes and margins, using some ratio of transcendental numbers for the final correction. Nothing was electronic and a typical submission would take you a couple of days while wasting paper worth about an acre of rainforest in the dry season.

But finally it was gone. It stayed there a couple of month until early December when I received this cryptic fax from PNAS.


That was just before Xmas and not the best time of the year to start running complex analysis but this was my chance, my break, my day, my year! I jumped into it with all the energy you have when below 30. I think we did a pretty good job at answering the reviews but  Xmas is a bad time and the editor almost immediately rejected our paper while inviting us to re-submit. Unfortunately I have lost that one as well. We did so on the first days of February and the paper was smoothly and permanently rejected by PNAS early March. Looking back with my current experience, I think we should have fought a bit harder… Still I got mad about it, decided to dump everything in some archive and forget about it. It is Jaap who managed to convince me to keep it alive and go for JMB. Janet Thornton handled the manuscript and that was the smoothest ride any of my paper ever had. And that was it. T-Coffee was published. It came out on the 8th of September 2000.

For most projects, that’s where things stop, and then you move on. The 192 hours  of teaching French  assistant professors owe to the state quickly got all this research nonsense out of my mind. The next big milestone came two weeks later. It was an email with “T-Coffee” as subject and “Lipman, David (NIH/NLM/NCBI) [E]” in the “from” line. Yes the MAN himself. He was sending me a very polite e-mail asking things about T-Coffee. David Lipman was asking me questions about T-Coffee! I have to stress this one more time: Mr Blast was interested in half roasted T-Coffee. I have had other epiphanies but this one was intense enough to fry the hairpiece I had received as a PhD viva gift. We exchanged a bit more and David eventually invited me to visit the NCBI for a couple of weeks. I would need another long blog entry to to describe the visit of the holy temple of bioinformatics.

Among the many things that happened while visiting the NCBI, one took place that entirely changed the T-Coffee citation fate. Eugene Koonin. Most young scientists looking for Eugene papers probably think that this must be a very common name among Russian biologists. They probably assume an army of Eugene Koonin-s. Well I have some news for you. There is only one and yes he has done everything and the rest and a bit more. And it gets even more confusing if you consider that he is also a pretty normal human being - that is as far as bioinformaticians go of course... Visiting Eugene was too good to be true but it got even better when I realized he had the same Indy Silicon Graphic workstation on which I was developing T-Coffee (nice blue boxes).  This matters because T-Coffee was - poorly - written in C and was only stable on this precise machine. I installed it on his machine, and ran T-Coffee on a couple of datasets. Eugene liked the alignments. I then bumped into David in the stairs who asked me how were things. I mentioned Eugene liking T-Coffee alignments. “He liked them!?” there was a mixture of suspicion, excitement and admiration in his tone.

Then I went back to my teaching in France and gradually forgot about all this. It took a good year for the first citation to come. It was Nick Grishin. Then Eugene and Aravind - the two most famous domain hunters - took on using T-Coffee on a regular basis, and everything started. If Koonin and Aravind are using an aligner, who would be crazy enough to use another one? Biology, especially wet lab, is pretty much a cooking exercise. You take a recipe and follow it line by line. If at line 5 the Chef says “Take T-Coffee”, then you take T-Coffee, not because you would blindly follow anyone, but because you know the Chef is good - you have eaten there food before.

One thing to know about methods papers is that they always start slowly and it often takes more than three years to get the first 30 citations. This is the reason why big journals don’t care about us - we don’t contribute much to the Holly Impact Factor. But once a method gets going, it can really rock, so be patient.




Well, I guess the time has come to wrap up this little piece with some element for the edification of young scientists. Well let me think... First don’t eat too much kebab or at least stay away from the variety loaded with french fries, secondly insist on getting fireproof hairpieces, thirdly make sure influential people know about your work - even if they eat kebab with you. We may like it or not but things in biology are very relative - social networks matter as much a gels do. Yes, there are people you trust and those you don’t. This is probably even more true now that automated metrics compilation systems can be fooled by anyone. With fake papers, fake reviewers, fake results, the only thing that’s left to us is reputation as supported by those with whom we get drunk. The rest is just damp octopus, squid and squib.

























Tuesday, 3 May 2011

Have we passed the prediction peak? The Fate of Theory in the In-Silvo Era



As a student, my only dream was to solve all the grand challenges of modern biology with a computer. I confess this ambition does not really set me apart from most bioinformaticists. I also confess my success does not set me apart either. I was utterly convinced that protein folding was merely a few clever equations away, and that upgrading my 200 Mhz Pentium processor would do the trick. In my young and innocent mind there was not the shred of a doubt that gene prediction would be cracked whenever we have the right weights for the right neural network. Like everybody, I gasped when Hidden Markov Models were introduced, fearing that these devilishly clever mathematical creatures would not leave a single intact question, gulping everything that was keeping me entertained, from gene prediction to protein folding. The truth is that in these days computer were slow enough so that you could daydream while they were running (I am talking about a time when an E-mail could take two days to go across the Atlantic). Many of these pipe-dreams materialized into pipe lines of limited use… We had a lot of fun though …

The first turn-off, for many of us, was protein-protein interaction. Gosh! Suddenly our clever predictive contributions were made redundant- and by what? An ancient (5 years old!) wet lab technique coupled with robots. A perfect humiliation. Oh we did not like it! How many coffee breaks spent complaining on the low accuracy, the low specificity, the taste, the color of massive two hybrids. Few of us realized at that time (I did not) that we were merely resisting the end of an era. The end of scarce data. The end of in-silico biology. I am not suggesting in-silico biology has been a failure, but we clearly got it off the mark on many things we were so excited about. In the vision most of us shared at that time, experiments would gradually become redundant, replaced with sophisticated equations sets -readouts at the tip of the return key. In reality, exactly the opposite happened. The experimental output of biology has been increasing by several orders of magnitude in less than 5 years. Add 3 zeros to your pay check and think how life would change…

Indeed while many of us were writing longer equations and shoving extra-states into Hidden Markov Models, biologists were eagerly jumping on the band-wagon of the most formidable technological shift since cloning: high throughput sequencing. They had simply become tired of waiting for us and had stumbled on this surprising finding: with the right equipment, generating data is easier than generating models. Before we could take our breath, the biology we had lived in was gone. Today, 90% of the labs I know buy 90% of their data. They do it from their core and will soon get everything from Amazoogle. In many areas of biology, data has stopped being a limiting factor. In all other areas, scientists are struggling to reformulate their questions and fit them into the NGS framework. Ironically, many of us doing computational biology have not yet realized what has happened. Indeed, on an everyday basis the job is still the same: same computers, same algorithms, same questions. The first hint I received that things had changed is when our sys admin suggested building a mid-size nuclear plant to power the new file-system.  The second hint was my wet-lab neighbors leaving 2 TB USB discs on my desk with a post-it note saying “Can you have a look?”. I did, I am doing this every day now. But to be honest, not so much has changed for me. It is just that the magic code I used to run, chewing predictions and turning them into yet more useless predictions is now over-fed with real data and is churning out experiment-based models. You realize the difference it makes when suddenly you are allowed to send your computer output to respectable journals, those that once made it a policy to reject any result “merely based on computational analysis”.

I call this in-silvo biology, that is to say in-silico biology domesticated to deal with data collected in-vivo. Some will argue that once we have enough data we can turn biology into theoretical physics. That may be so, but we will need to find a way to curve the impatience of biologists (a pill?) and this obsession they have that fresh data can cure every problem, from cancer to tenure track. In my humble opinion there is no way back to theory (at least for a while) and if you want to convince yourself, do what I just did, run google scholar for “predictions+biology”, and you will find that the prediction peak was back in 2006. If you believe -as I do- that Popper cannot be falsified then you can only conclude things won’t look too good for theory, at least for a while. Why guess when you can look? -said the blind man to the short sighted …

C├ędric Notredame