The Slow Bioinformatics Blog

Sunday 12 September 2021

Nextflow: Flowing in love with data, again and again

It is always a bit tricky to do some story telling about the development of an IT software like Nextflow. If you have no idea what Nextflow is, do not feel alarmed. Most people in the real world have no clue either. To make it short, Nextflow is a pipeline language developed to deal with genomics data. Nextflow helps processing very large datasets in a reproducible way. What kind of datasets? Well let us say you have a cancer patient whose genome has been sequenced in order to look for cancer driving mutations. You will need a pipeline to analyze this data. This is where Nextflow comes handy. More and more people use it to implement the pipelines they need. They also rely on Nextflow to run the computation on a big central machine, on the cloud or even on a laptop. Today, I am not really going to explain how Nextflow works as this is all explained in the original paper. I would rather tell you a funnier story: how it all started...

For a start, I must make it clear that I did not write a single line of code in Nextflow. Such a non-contribution was an entirely new experience to me. As far as PIs in my generation go, I am still pretty hands on and I take a lot of pleasure in re-implementing my lab’s algorithm, slowly growing T-Coffee, my favorite piece of code, into some kind of convoluted monster.

But this did not happen with Nextflow, which taught me one of the most important lessons in my career: trust those who know better than you. Doing so is not simple. Many years ago, when I started as a PI, I did what I occasionally see junior PIs doing: I started drawing arrows between boxes, hand-waving on the obvious simplicity of the connectors. Not that simple, I am afraid. Reality shows that in a research lab, there is blood dripping from almost every arrow drawn between each and every box of a scientific diagram. There will always be blood, but the only way I know of to stop the hemoragia is to not pretend knowing what you do not know… Not the most valued skill these days I am afraid...

But back to the beginning… I am not really an IT guy and my first contact with virtual machines happened with some odd back-up solutions that were made possible on PC, about 20 years ago. I got fascinated with the fact that once dumped the right way, your machine could be Frankesteined back into its exact original state - you did not even need thunder. I had spent my PhD working on genetic algorithms. These things were running forever and always crashing at the worst possible time, and no matter how hard you tried to dump the status every once in a while, the resuming was always tedious with random seeding pushing repeatability out of bound. So the notion that your entire machine could be snapped-shot and resumed anywhere felt like a miracle. Then I moved to the CRG. This was an interesting period when computation was suddenly becoming, once more, an issue in biology.

When writing “once again”, I should probably clarify what I have in mind... During most of the 90s, universities kept buying gigantic equipment, possibly because this looked like a respectable way to burn money and make a statement that you were investing hard cash in research. Hardware companies like Intel and HP thought this was indeed a great idea. Also, having the biggest computer in the universe would put you in the spotlight for a couple of weeks. To put it simply, CPU was scaling well with money and ticked all the boxes needed for a report. This is probably genuinely true in some CPU intensive data simulation domains, like meteorology or aerodynamics, but at the time this was not so clear in genomics. So when you had the university knocking on your door and asking if you could do great science with their big machines, you had to say “yes we can!”, and then you were left to scratch your head. The thing is that in these days, the computational complexity of most genomics heuristic algorithms was so bad that having machines 10 times bigger did not change things much. At the human scale, waiting 10 billion years, or 100 billion years is pretty much the same thing. Memory requirements were even worse. What kind of problem can you have that eternal life would not solve!!? Well, memory allocation is one...

Of course, I am talking mostly about the usual suspects like building trees or computing big multiple sequence alignments. Database searches were a bit more fun, but the problem is so embarrassingly parallel, that most labs ended up buying cheap PCs to build their own Linux farms; and with data production following Moore’s law, it wasn’t clear at all if this would ever become an interesting problem. To put it crudely, the only sexy way for biologists to burn CPU was molecular dynamics. “We will run CHARMM” would easily unlock the gates of heaven. I am actually curious to know how many billions worth of hardware investment were justified that way.

But everything changed by the turn of the century. Large Scale Genomics happened, fuelled by high throughput sequencing. Suddenly, subtle changes in the sequencing methods meant that biologists were producing sequences hand over fist. The growth of next generation sequencing output seemed impossible to curve - and it is still looking that way. It is not only that the density of sequencing machines is by-passing Moore's law, it is also that they are getting so cheap that their number is also growing exponentially. And when you stack exponentials, things happen...

For instance, all of a sudden, your new computer was proportionally handling less data than your previous one. This was a shock because I had become used to being able to do and run everything within my laptop. This little black rectangular box was significantly more powerful than Gin and Tonic, the two supercomputers rumored to have been gifted by the MI5 to the EBI when I was a student there. With the new data at hand, like genomic reads or RNASeq datasets, simple options were falling off the table one after the other. Fortunately, the CRG was rather fine storage-wise, but it was seriously lagging behind in terms of computation, and, as it happens, Amazon was beginning to offer online virtual machines. This was hard to resist. I once discussed this idea of exploring these possibilities with Anna Tramontano who told me, “Cedric, I have just the person you need”, and Paolo Di Tommaso happened... It started very simple, a master project I had pompously named Cloud-Coffee and whose goal was to benchmark T-Coffee on Amazon.

This is probably the only benchmark I ever did where rather than measuring sums-of-pairs or seconds, or more esoteric things, we simply measured something that would get everyone’s attention: cost in dollars. The game was to know how much it would cost to do all of our benchmarks over there. This was not an easy thing to do. The reporting tool on Amazon was a bit rough, and you had to keep reloading and quickly catch your dollar figures. The CRG was suspicious but supportive. Opening the account had been complicated because it had to be billed from a corporate card, and all these things, but it went well and we published the paper.

Then we got a bit bolder… Jia-Ming Chang was finishing his thesis in my lab. He was working on a very time intensive benchmarking system, where you had to compute zillion of alignments and measure tiny effects. Jia Ming had it all worked out, running like clockwork on our cluster, but Paolo and I managed to convince him that it would be cool to run all that on Amazon. Saying no was never Jia-Ming’s strongest skill and we knew that this geek at heart would be unable to resist the prospect. Then we started the computation and watched the dollars being transferred from my budget to Jeff Bezo’s account with very little happening. I mean very very little. Five thousand dollars later and with a few million three-leaf trees and one sequence-MSAs, we pulled the plug… Dependencies had killed us. The big fat instances we needed for the trees had spent their time and our money waiting for one another. I am not a real computer scientist but I know enough about parallelization to be aware that these mutual locks are the worst thing that can happen. Yet, without a proper centralized storage system, and with the inertia of popping up the instances, proper parallelization was beyond our reach on Amazon - at that time...

Paolo was hurt in his pride, I was hurt in my wallet, and Jia-Ming was knocked off his PhD deadline. We called it a day… I had another computer scientist in the lab, Maria Chatzou who had just joined us. Maria was about to take over from Jia-Ming’s work and she was understandably worried about managing massive amounts of computation. It is not that the kind of things we compute are very complicated, it is just that you have layers upon layers, that you need to keep connected. Like sequences that you turn into guide trees, that you may want to replicate, that you then turn into alignments and that you turn again into trees, and you do this hundreds of thousands of times. If you are an old guy like me, you script everything in Perl, pack it with “if file already there skip computation”, and you hope that tomorrow you will still be able to remember what you had in mind with the hash table named, euh... %H5. Python helps a bit, but still, when you start doing arrays of readouts, and arrays of arrays of arrays, everything becomes so nested that any tiny change turns everything into a new project. It is a bit like in those days when you would Xerox your PhD double sided, and the machine would jam, and, you know, it’s easy because you take these half printed sheets and put them back there, upside-down and face up, and… ahem, oups no, wrong side, etc…

Makefile would have been fine, but the computation of the dependency graph was a killer when using the GNU makefile, and at the time Snakemake did not exist, or we did not know about it. The main issue with makefile is that they build the tasks from the start till the end before starting. It means that makefile will not start any computation if it does not know for sure where it will end. Clearly not a Mediterranean concept… The alternative to makefile is called reactive programming. It is something that Paolo had started talking us into. The best example of reactive programming is piping in UNIX. Some data gets sent into some program, and things keep happening as long things keep flowing in. Nothing happens before and nothing happens after. The process down the pipe has no knowledge of how much it will process, and for how long. It simply knows that if a unit of data - arbitrarily defined as a data stream with specific properties - comes in, it has to munch it and spit as a result, another stream of data..

There is in front of the CRG, a giant sculpture representing a headless fish. It is known as the Golden Fish by Frank Gehry and was put in place for the 1992 Olympics. I cannot remember who once told this to me, but some story goes that the fish is headless as a tribute to Barcelona, that welcomes any one coming in without any second thought or prejudice. I never found anything written or else that would corroborate this nice story, but I quite like it. And in a way the headless fish is the precise symbol of a pipe or a Nextflow process. It actually defines the scaling capacity of an unprejudiced world.

Anyway, back to the lab. A few days after Paolo gave us a live demonstration of reactive programming deployed on Amazon, I had him and Maria banging on my door and screaming that we had to develop our own language pipeline using the reactive programming paradigm. I listened carefully to their justifications, thought deeply about it and emitted a profound , succinct and precise instruction: “NO”. As a consequence, Paolo, who had always been a master at reading between the lines, started immediately working on the first Nextflow prototype. This was about the end of 2012 and a few months later, on the 9th of April 2013, Paolo did put on GitHub the first public release: Nextflow 0.2.2. To be fair, I had said no because I thought developing a language would be beyond our capacity and I was afraid we would not be able to support it, but it was clear we needed something similar, and I quickly agreed that no matter how you turned it around this thing had to be a language. But being a language was not enough, it needed a special component, and that component, what makes it special, is the concept of reactive programming it is built upon.

Within my group, the success was immediate. The reason is simple. We had developed Nextflow for ourselves, because we needed it. My lab has always been smallish, about 10 people on average, and I have never had the resources to develop grand tools. I mean tools where you write the specs, turn them into documentation and do the implementation. Paolo understood very well that Nextflow had to be simple, and usable by our kind of people, that is to say, non professional computer scientists with a good grasp of scripting. He also knew that it had to be evolvable. This is exactly how Nextflow had been thought and this is the reason why so many people started using it beyond the CRG. It took 5 minutes to install, and even less to run your first pipeline. And what it was doing, the way it was doing it, was kind of obvious. Another important trick was that it could be used as a wrapper on existing pipelines, meaning that you would simply need to wrap some Nextflow tags around your old pipeline, whatever language it was written in, and Voilà! Recycling old pipelines was a non-negotiable thing for me, and I soon realised many of my colleagues were in the same situation. They had all these old albeit critical pipelines left behind by talented students whose many gifts did not include writing documentation.

But some bits were still missing, and most importantly the cloud thing was not yet solved. Our next step would be creating a cluster on the Cloud. This was not trivial because all these machines were not meant to share disk or memory. They were all somehow independent and Paolo kept digging out new innovative solutions equally ineffective: it was either working smoothly for a few minutes after hours of set-up, or elegantly crashing after a few minutes of set-up… It is around these days that Paolo brought Docker to our lives. For the record, he did this over a year before Google announced that Docker was a thing. This was all very confidential. In 2014, we published our first Docker paper. I remember being asked by the CRG faculty what had been my most influential paper of that year and mentioning this one for the highlights. My suggestion was politely ignored.

Now, I have to be fair to the CRG. Even though it is not entirely clear the institution understood the real aim of the project, they supported it without a flinch over many years. Between 2012 and 2017, I wrote every year in my annual report that we were working on a solution for the deployment of high throughput computational pipeline on the cloud, and all these years we had little to show for, but the support kept coming in and allowed me to pay Paolo so that he could work nearly full time on this. Given the track record and expertise of my lab, I can hardly think of any funding body that would have taken a chance on us. In fact we tried to raise some money a couple of times but never managed. This is just to say that core fundings is probably the most important and instrumental source of support to generate true innovation. Without the CRG core funding I would have little to show for. Grants are nice, but it has to look exciting, and you have to be lucky to be evaluated by people who understand why you are excited. And then you have to do stuff you may not believe in any more by the time you get the money. A lot has been written on why this is a poor system, and why it is a mathematical certainty that the best project will be the hardest to fund. Core funding is different, it is about hiring the right people and giving them freedom to do stuff for which they will be accountable. It is the closest we have to random funding, which, given the random nature of research, is probably the only way to go.

Now back to us and our results. Which results? Exactly... We had little to show, at least academically speaking. This is where Evan Floden steps in. Every year I teach a course in Bologna, at Rita Casadio and Pier Luigi Martelli’s master. And then my wife and I come back with two more kilograms (each) and a master student. 2015, was Evan’s turn. Even by the unusually high standards of this master you could tell that Evan was going to be special. He already had papers with Sean Eddy and the Rfam gang and was curious about absolutely everything. Every year the CRG runs an international competitive PhD call, and in his year Evan came out as the top ranked candidate, which allowed him to be funded by the La Caixa Foundation, like Maria had also managed just before him. His PhD had nothing to do with Nextflow, but he rapidly got hooked on it, especially the games with virtual machines and moving computation around places. If there is one thing I am proud of having achieved in my lab, it is fluidity across projects, with anyone feeling free to chip in any one else project. Of course you have to be strict on who owns what and so on, but having a collaborative rather than competitive environment makes life so much more pleasant on a daily basis - and so much easier...

Before Evan, Nextflow was a cool technical project that we planned to publish in some good solid no thrill journal. Then one evening the whole lab went out for pizza on Paral·lel, a large avenue of Barcelona, and the obvious place to discuss parallel processing... Nearly two years in the COVID era, I realize how strange such evocation may feel to many of us, but yes, having a pizza with your group used to be possible, and I hope it is again when these lines are being read. Anyway, we were having a pizza and Evan said, with his distinctive kiwee intonation, “BTW, I have been playing with Nextflow and Kalisto, did you know the results are different across machines ?”, The PhD supervisor in me immediately replied “Zis are RRandom seed geneRRators fluctuation Heavan”, “No” he replied “the results are reproducible on a given machine, but then they change depending on the operating system”. I was quite impressed, but he had saved the best bit for the end “but if you dockerize your computation, everything becomes reproducible again, even on AWS, the Amazon Web Service”. I was stoned, and suggested we looked at this first thing tomorrow. And he was entirely right. All things being rigorously equal: your data, your software, its version, etc, tiny variations were occurring across distributions. This was nothing huge, typically 1-2% variations, but if you do not expect them to occur, such variations are project killers. Imagine yourself trying to debug the pipeline of the paper you submitted 6 months ago, and being unable, as a start, to reproduce exactly your original result, simply because the cluster has been updated… Also, imagine that the same analysis of your exome points at two different target genes in two different hospitals…, and on and on… We had it immediately clear this was potentially a huge story, and it was time to write it up.

The Nextflow paper is not any paper, it is the first high impact paper that I managed to drive without being a simple hitch-hiker. The writing of this paper has probably been one of the most intense and exciting experiences of my academic life. We hear a lot about the DORA declaration on the assessment of science and all these things these days, and I am sure the people behind this mean well. In fact I agree with many of their positions and their implications, but not all of them. Especially when it comes to high impact factor journals. Writing papers for these journals requires developing a different set of writing skills. I understand many of the reasons why people are annoyed with these vanity papers, but I see it a different way. It takes about two to three times longer to write a paper for a 30+ IF journal, especially if you are not a native. You write these things with a razor. They have to be short, sharp, impactful and correct. It is as close as scientific writing can get to poetry - oups, here we go again on my linguistic obsessions... It was not the first time I was playing this game, but it was the first time I knew I had a real strong hand, and all of us were determined to pull it through, no matter how much pulling would be required.

What I am trying to say here is that you will only go for it if you think it is worth it, and this self evaluation is essential to provide Science with a sense of perspective. Everything we do is not equally relevant to the big picture. Some things carry more weight, and it is very important for any one new to the field, especially the students, to rapidly grasp the big picture. If all papers were published on a fully flat hierarchy, then it would be hard to really figure out what matters. Of course, social networks, and “likes” impression, and stuff could help, but this would be just re-creating high impact with an increased social component. Some will argue that the social component is more transparent than the editor's world. That may be true, but at the same time, the COVID-19 has just exposed how sensitive these things are to manipulation. If news were alleles, fake news would be the fittest - by far.

Now back to the Nextflow paper. It lists six authors but in reality, the main one is missing and not even acknowledged: Google Doc. I have no Google shares, but I humbly admit that without Google Doc the writing would never have been possible. This paper was 100% collective writing. We started with a rough draft and rewrote each and every sentence together. The beauty with a shared doc is that it happens in real time, everyone in front of their OWN computers (i.e. not sitting next to your supervisor keyboard!) in a room, or over the phone. You have editing fights. I had to write quite a few times “DON’T TOUCH THIS SENTENCE”. But the most rewarding thing is the possibility to tune it so that everyone reads the same thing. And everybody can contribute, even the most junior members of the team, who may find it harder to edit sentences or improve them. They can still argue that something does not make sense to them. It is not something to discuss, it is something to fix. I have now written 4 papers this way. Each of them was an amazingly intense process leading to what I consider masterpieces in my career along with some of the most pleasant group binding experiences you can generate.

Getting the paper accepted was still a long process. If I recall well, it took over 6 months, going back and forth. Pleading against rejection. I think we rewrote it almost from scratch a couple of times and I saved a total of 27 versions with countless edits in between. But eventually we made it through and the resulting sense of collective victory was overwhelming. When it happened, Nextflow already had its own life supported by a vibrant community of enthusiasts for whom Paolo was acting as a grand master.

It had been clear from the start that Nextflow was an engineering project and that its future was in between academia and industry. I was therefore not totally surprised when Maria and Pablo Prieto, two of my PhD students announced that hey wanted to incorporate a company, named Lifebit, making use of Nextflow, but when Paolo and Evan did the same with Seqera Labs I realized this was becoming a trend. Both companies are doing amazingly well these days, and Seqera Labs, have just successfully launched Nextflow Tower, a powerful computation controller. They have also just raised $5.5M while securing three grants from the Chan and Zukerberg initiative. LifeBit is just as successful and its sharp focus on AI led it to being recently awarded a major genomics contract in Honk Kong.

Those who know me are well aware that my core interests have always been much more academic than business oriented, but I am immensely proud of our work. Thanks to our collective work, a very technical breakthrough whose normal fate could have been to become an obscure addendum at the bottom of a mat and methods section has received the light it deserved. And this exposure has contributed towards reproducibility.

Standardizing pipelines is not only about efficiency and applications. It is also about readability. Soon, all of our medical data will be processed by such pipelines. Based on your data, they will spit a number, sometimes this number will mean you are eligible for the miracle drug everyone is talking about, but at other time this same number say that the drug is unlikely to work on you and that social security should rather spend its limited resources giving it to someone else. You will then be provided with an older less effective treatment. Too often, this will be the difference between life and death.

If this happens to you, the sense of unfairness is probably unbearable. Like being punished twice while innocent. And if you do not trust the source that made this decision, the sense of unfairness will turn into boiling anger. It is only with sufficient transparency that we will engage our fellow citizens in accepting that the analyses deciding on their fate are being fair to them. Open code when it comes to genomic analysis is not only a scientific issue, it is a democratic issue. And to be of any use, open code has to be readable by people we trust, that is to say, as many people as possible. And to achieve this, the code must be readable.

This is where Phil Ewels steps in. Phil was one of the first adopters of Nextflow, at SciLifeLab and his philosophy was that of providing the whole world with good trustworthy tools. With a group of enthusiasts he created one of the most vibrant Nextflow community. The NF-Core collection of curated pipelines they put together is one of the reasons so much has been achieved with Nextflow. If I had to pick one major success, it would be the use of Nextflow in CLIMB-Covid. These people have created in no time the most sophisticated ever system for COVID data processing and their software has analyzed about a quarter of all COVID genomics data produced these last two years. COVID is still there, but in the West, money and vaccination are turning it into a manageable curse. Vaccines were part of the solution, but they were not enough. I would argue that data processing is just as important as vaccines, and in fact, this is just what the new variants are telling us: keep processing data. Well, here we go and the fact that these people found Nextflow a useful tool justifies all of our efforts.

Another thing that I am really happy about, as a PI and a scientist, is that we could achieve this without having to change the lab direction. Nextflow was an incidental support of the work on sequence analysis that I consider central to my academic interests. Being able to contribute to such a technical project without having to change direction was just as rewarding to me as the work I did in sociology with Lausanne university, many years ago - but that’s for another blog post. We hear a lot of the word interdisciplinarity these days. To be honest, I can hardly think of a concept that has been so worn off and misused. Whenever I hear it coming from the top of the hierarchy, I just switch off, as it usually means truck loads of money poured on the wrong spot, with instrumentalized scientists pressured into claiming about their interdisciplinarity, with tattoos and piercings. It never works, and the reason is simple: purpose built interdisciplinary work is usually lame for any of the disciplines it is made of: it falls between the cracks because nobody finds it truly exciting. On the contrary, incidental interdisciplinary work cannot fail. It comes out of chance and necessity. I was not especially interested in IT. In fact, I don’t even like computers, I like algorithms and the closer I get to hardware and its plumbing, the more I feel that I am back to my teen years, having to tweak my moped again and again. Something I was never good at. But I needed Nextflow, my lab needed it and I had no choice but to hire the best IT person I could find and get to explore the IT world with them. I got very lucky to find similarly minded people in my lab. I would re-do the entire journey any time.

Wednesday 26 July 2017

Save Our Silos - A case against forced interdisciplinarity

More collaborations! We have to destroy data silos! Walls must be torn down! Barriers removed!!! Now that we have just entered the Trump era, my innate sense of opposition makes it hard to resist the open-field ideology. After all, the orange clown wants the contrary, does he not? If Batman’s nemesis wants to build walls, shall we not invest in Semtex? Still, my experience is that any rhetoric easy to flow is usually flawed. Especially when it is hard to disagree with. Think of Miss Universe sobbing her “peace in the world” usual wish - who would even think of disagreeing? And who would give it a second thought… Unfortunately, Trump does not hold the monopoly for stupid ideas. That would be too easy...

So, is full blown collaboration good for research? Well, given that you are reading this blog and that some of you have, hopefully, read some of my papers, and the other way round, I guess the answer is yes. Collaboration and exchanges are good, not only good, they are essential for good research. That does not need to be proven because it is a condition for survival. Period. Now the remaining question of course is, how much of it? Does every project has to be a collaboration, 50%, 10%? Is there a magic number, does it depend on people, places, time of the day, moon phases, Chinese horoscope year?

In a very successful southern institute that I was recently surveying, and whose three letter name I will keep to myself, five lines of perl chomping over a scopus dump revealed a staggering 80% level of collaboration between programs. Yet some of the researchers were complaining about the management recurrent mantra on “insufficient collaborations, and drastic steps that should be implemented to increase interdisciplinarity and collaboration between programs”. When all the buzzwords pop up in the same sentence, I usually sense red light warnings sweeping through my neurones like measle patterns. I apologize for being so biased against the obvious. It’s part of my job.

The thing with collaborations is that they are easy to explain to politicians, and, let us be clear about it, collaborations lead to harmonization that leads to improved productivity, that leads to increased wealth. Politicians, and statesmen alike, understand this well and it is hard to curb their enthusiasm at the prospect of being re-elected for successful economic reforms. If you are in production mode you want everything to be harmonized, and walls to be destroyed. This is why big companies become even bigger - there is a big corpus of theory behind this and everybody agrees it makes a lot of sense for all the stakeholders even - sometimes - the consumers.

Only kidding… I have to admit that making better drugs, better phones and better everything is, indeed, a prospect I find it hard to frown upon. I like engineers to do good stuff. Over the years, managers have used all their linguistic skills to have us calling this innovation. A new phone is innovation, a new drug is innovation, a new car color is innovation. You name it.. No! Not even! you don’t have time to name it and it’s already innovation. Through an ingenious semantic shift these smart innovation leaders (i.e. those who name innovation faster than the rest) have even lead us to believe in corporate research, the (in)famous r&d. Does it work? Does r&d really create novelty, and if it does, is it the result of increased collaboration and communication?

Well, let's take the simple analogy of cinema. On this page (http://www.filmsite.org/boxoffice2.html) are listed the top grossing movies across the whole 20th century. I have picked up two decades, the 60s and ours:

We could argue a long time about the fact that the old flix were more creative, etc, etc and any one could handwave his or her own way, but there is something that can simply not be denied. The 2010 decade is entirely dominated by sequels: Star Wars 8 (!) Iron Man 3, Toy Story 3, Furious 7, etc, etc. Overall, out of the 10 most successful movies of the current decade, all are sequels, 9 of which with an index higher than 2... Just have a look at 60s for a comparison. Not one frame of repackaged stuff! All these movies were brand new bold ideas. Mary Poppins, my fair lady, Dr Zhivagho. Not a single one that most cinephiles would not consider a small jewel in its own category. Very good movies in which most characters have the elegance of dying in the end as both a token of respect for real life, and a gesture of defiance for sequels.

Yes, not only have things become much more expensive, but the level of creativity has dramatically fallen. This sort of things happens when you break all walls, get everybody in the same big swimming pool and start heating things up, you make soup, and not a good one. It seems that breaking walls works much better when it comes to manufacturing goods than to foster diversity. Diversity and uniformisation do not really go together. If you want to convince yourself, just go on a field trip in your favorite tropical spot, put on a snorkel and look! Where are the rainbow fish, Nemo and his friends or this weird seven and half legs octopus munching on an anemone? In the sandy open space, waiting to be chomped? or in the intricate collections of caves, tunnels and chambers carved by coral.. Of course sharks have long argued that these divisions should disappear and give way to a more rational organization of their food supply but ...Where does diversity thrive? In the open or in the fragmented? Then again, where would you set to cultivate your oysters? Here you go. Productivity versus variety.

Of course, politicians could have asked geneticists. They would have told them that in a population, the probability for a new mutation to get fixed is proportional to Ne, the effective population size (https://en.wikipedia.org/wiki/Fixation_(population_genetics).

What this means is that the larger your population, the harder it becomes to create stable diversity. If you want diversity to arise and make its way, you need many small independent populations. Surprising but obvious if you think of it a minute. Just take your five favorite small countries and compare their combined diversity - of any kind - with any larger country having their combined population size… Think Europe versus the US…

Of course, said this way, it is hard not to long for US style uniformity, especially when you have a computer to plug... but now let us switch to your google life. You wake up in the morning. You have had this weird idea in the shower about CRISP-R re-engineering of sabertooth kangaroo and your spine is shivering at the translational prospects of this innovative project. You are already putting together your address to the nobel committee, with tongue in the cheek bundling of a few subliminal messages to Trump and Brexit. But wait!!! you google a bit and find that some second rate scientist, from some obscure university, has stolen your brainwave before you even had it! Adding misery to injury, the dude has made a mess of it and published it in the annals of improbable marsupials. Thunder is gone, it’s only mist and steam from the shower you left running... back to reality and well structured work, take off this bow tie.

Had you not known about it, you may have polished your idea, possibly re-inventing the wheel, possibly finding an alternative solution that would have resulted in new possibilities. At best you would have been eaten alive by your creature, but most likely would have wasted your time, or, with a very very low probability, you would have made an amazing breakthrough and changed the face of kangaroos forever.

Alas… Now that the law states that the tires of re-invented wheels must be punched every second day of the week, working this way has become impossible. In our efficient world there is no room for redundant ideas. Yes, if brains were species, increased communication would decrease their effective population size. For many things, like wikipedia, this is just great. For others, like the emergence of novelty, it is the neuronal equivalent of the cretaceous extinction, a perfect ecological wipeout.

So what does this have to do with inter-disciplinary collaborations? Let’s put it this way: when you collaborate, you exchange and harmonize ideas, and your community effective size becomes larger. By many measures this is great as it allows bolder projects - like the human genome - and brings in new ideas, like speech recognition HMMs chewing human genomes.

Such cross-fertilizations constitute one of the engines of scientific progress. Yet, at the same time, larger communities make it harder for new ideas to emerge. Journals, reviewers, community, twitter. They all think they understand everything and make sure any attempt of novelty gets squashed in its early days. On average they are quite right… or are they? The thing with novelty is that it is not an average phenomenon, it is a spark that eludes any prediction. And who would care… Unfortunately, novelty turns out to be the other engine of scientific progress. And yes, with two engines going in opposite directions, well you need a strong cable, and you need to make sure each engine keeps pulling. This is why we need a good healthy tension between globalization and fragmentation. One cannot go around claiming one of these is the solution.

So what shall we do? I think I have the right answer because my answer is not even an answer. In the Trump era, at a time when both the pro and the anti know what is good for everybody, one should be weary of simple solutions. The only thing that I remember from my history classes is that whenever some character with a mustache, a beard or feather on her cap claims he or she knows what will make everybody happier, anything between one and a hundred million people die. With very convincing and charismatic leaders, one could probably go a bit over this figure, and the future looks really bright if you support massive primate extinction… only kidding, this is not a primate thing and there is no reason other species should miss out on the fun...

So what is good for basic research? Difficult to tell. Basic research is a very fragile eco-system. It produces little, in a highly unpredictable way, but when it does, it changes everything. In biology, the two groundbreaking shifts about to re-shape our lives, and probably the genetic makeup of our species, can be traced to very specific, not so collaborative, and hard to fund project. One is the restriction enzymes that opened up the era of biotechnologies and the other one is the CRISP-R mechanism. No matter the amount of story telling later built upon these things, the first one had to do with a scientist so obsessed with restricted growth in bacteria that he ended up studying it with money allocated for a different purpose (read Arber’s most likely not edulcorated own account on this)... The other was the brainchild of a Spanish microbiologist inhabited enough by his trade to escape from the Almeria’s beach every once in while and check for his computer’s output. None of these things were really planned, none of these things were easy to fund, none of them were branded as interdisciplinary, none of them would be funded today. No milestones, no interdisciplinarity, no future. Only ultra self driven scientists.

When these stories are told at conferences, the big cheese usually chuckle, implying with a laugh that these undeniable exceptions can safely be ignored and that originality should give way for the grand plans of these great men. We should resist, quietly and stubbornly, as we should resist any oversold idea. But what kind of resistance? I do not want inter-disciplinarity to go away. In fact, as a scientist, I cannot even imagine my life without the excitement of new adventures in fields unknown to me. I love these escapades because they are complicated to organize. My first interdisciplinary affair was with Lausanne Social Science department, studying life trajectories as if they had been genomic sequences, later on, my group collaborated with Mar Diesrsen teaching mice how to swim
and our Nextflow recent production results from an immersion in the IT world. In fact, looking back, about everything I do put together is made of the unnaturally fitting of intellectual objects. None of these things were ever properly funded - you know, grants with milestones and deliverables.

So what’s wrong? Well it’s very simple: I do it because of the unstoppable urge to follow the few obsessions I have had for a long time as a scientist. I don’t care about the system, I only care about my internal drive. I have no major claim that anything useful will come out of my work, but I know that these are people like me who create novelty. Old style obsessed scientists with absolutely no interest for buzz words. I am not saying that this breed of mammal is the only thing we need - remember it is an eco-system. I am simply claiming that without them there will be no novelty, only engineering. Of course in a majority of cases, the value of these novelties will be useless, sometimes their thunder will be stolen by more agile members of the community - or not. But who really cares? as Francisco Mojica puts it, finding the pattern of conserved sequences on his computer screen was the happiest day of his scientific life. A tiny, highly intimate emotion to him, a change to come for mankind.

Saturday 6 August 2016

T-Coffee Reloaded

The last time I looked it up - an hour ago - our original T-Coffee paper had 3602 citations on scopus. I used to think this was a lot, until Nature ran this story on the Kilimandjaro height of scientific publications. The most cited paper is at 300.000... It gives all these numbers some kind of perspective I guess. I am not a huge fan of modern metrics and I usually find difficult to stay awake in front of Hollywood blockbusters ranked by their box-office gross product, like Batman 25 or Superman 2^6, yet I tend to think popularity and quality are simply not correlated as opposed to being mutually exclusive. This rigidly self-enforced open mindedness allows me to consider my highly and poorly cited papers as equally good - or bad when I am not in the right mood…

How you get that kind of citation level for a method paper is not a straight road, in fact it is not a road at all, more of an accident and a fall or anything related to bumping your head in the dark and waking up in hospital. I thought it might be interesting for younger scientists to get an idea on how these things happen and why no time should be wasted at planning them. The truth is that even a story I should know perfectly, like T-Coffee, turns out to be riddled with speculative patches when tracing back onto how things seem to have really happened.

For those who have no clue what T-Coffee does, il is a multiple sequence aligner. It means that it takes a bunch of biological sequences - typically proteins - that have evolved from a common ancestor by accumulating mutations, insertions and deletions. Aligning them involves putting in the same column - aligning - the amino acids (represented as letters) that were already present in the common ancestor, as shown in the picture below. The rest of the positions - those not homologous across all the sequences - get padded with null symbols (-) we call gaps - just like the ‘mind the gap’ in London’s tube. Said this way it looks pretty simple, but it turns out that it is one of these computational problems that cannot be solved exactly - period. Computer people call them NP-Complete. These problems are good fun because as far as solutions are concerned, anything goes, just like Niagara fall stunt contraptions. And trust me, over these last 20 years, anything has gone... It is hard to think of any optimisation algorithm - no matter how crazy it may sound - that has not been thrown in the face of the multiple alignment problem. From Simulated Annealing to Genetic Algorithm, Tabu Search and probably many more I have never heard of. T-Coffee is one of them. Why do we care so much about these multiple sequence alignments? Because they can be a useful starting point to infer most things that matter in Biology, from evolutionary trees down to enzyme active sites analysis. This explains why methods describing them are among the most cited - not only in Biology but in Science in general.

T-Coffee started with another multiple aligner named Dialign, or to be more precise an earlier paper by Burkhard Morgenstern, in PNAS, about gap penalty free alignments. It came out just when I was finishing my PhD at the European Bioinformatics Institute. I really liked Burkhard’s paper. I was especially impressed with the concept he named overlapping weights. I don’t want to go into anything technical here, but these weights were smart because they allowed all the sequences to talk together while being aligned, for a tiny extra computation cost. I liked that and spent a few unsuccessful nights re-implementing the concept in a quick-and-dirty way. I failed and moved on with my main project of the time which was to get a genetic algorithm computing alignments through in-silico sexual activity (aka genetic algorithm). But the idea - I mean the Dialign idea - lingered on and four years later it was still in my mind when I eventually implemented T-Coffee, and combined Burkhard’s weights with the ClustalW progressive algorithm. Said this way it looks pretty straightforward, but things are a bit more complicated and my take on this has been a major source of - friendly - disagreement with Burkhart who insisted many times that the two approaches are very different.

If CRISPERism was to become a trend, this would be the exact opposite. Two scientists arguing to establish their non-paternity of a method - “we stole your ideas!” “No you did not and we will resist any attempt of you saying you did!”. Half kidding... the aligners world is very civilized. Of course, Burkhart has a few good points, especially when going down to the fine grain details, but it does not change anything to the fact that I had the overlapping weights in mind when designing T-Coffee. I find this a great showcase of how alternative realities can coexist - even (especially?) in science. And no, I am not attacking Led Zep. I want to believe they were acting in good faith.

Another thing that makes T-Coffee a very average research project is that it did not start as a shinny clinking idea that I would have had in my bathtub, or, worse, while writing a grant. Quite the opposite: T-Coffee was originally a bug. At the time I was evaluating alignments by comparing them with other alignments and I somehow messed up the file names and ended up running unintended comparisons. Readouts were very good, the kind of very good I find very suspicious as a PI. With such results it was either instant fame or else. Taking care of the else factor resulted in the usual degraded performance and shattered dreams of fame and Science Magazine covers. I remember coming home that night on my red mopped - registered with diplomatic plates thanks to the EMBL international statute - and sadly chewing on my midnight kebbab. Scientific failure is never healthy - neither is success by the way, too much sulfites. On the following day I did the right thing. I insist on this because I do not recall doing the right thing very often in my life, but that day I did. I carefully traced back and figured out why it had looked so good for a while. It turned out that the suspiciously informative comparisons had been made against collections of pairwise alignments. It’s like taking all the sequences, aligning them two by two, and checking on the agreement against a full multiple sequence alignment. This is the precise moment T-Coffee was born and it has not changed much since then.

At the time it was not T-Coffee. It was called something obscure. When searching for a name, I could not resist a libertarian quick fix by coming out with the silliest acronyme I could fit. It became “Consistency Objective Function For alignmEnt Evaluation” the E was a bit of a cheat - I know and I don’t apologize. I then met in Greece, at the ISMB, with Des Higgins, my PhD supervisor. I would usually come to Des with crazy ideas of neural networks coupled with genetic algorithms. Des took advantage of these discussions to teach me everything I needed to learn as a future PhD supervisor: “Do it!”, “Could be...”, “I don’t know”, “Have you looked it up in the literature?”. But on that special occasion his face lit up and he immediately liked it. Even-though he may have forgotten about this, his encouragements on that specific day remained the driving force of what was to become a lonnnng and mostly unsuccessful project. This was the summer of 1997, T-Coffee was already a year old.

I had to finish my PhD in a rush because EMBL likes its PhD program to look efficient - as does the EU - and I wrote a big fuzzy paper that read more like the leftover of a not so smart but much luckier Galois. As one should have expected - I did not - it was smoothly rejected by NAR and eventually published in Bioinformatics - the journal formerly known as CABIOS. I like to think that the paper acceptance had nothing to do with me harassing Barbara Cox - the secretary, or Chris Sander - the editor - who where my office neighbours. Then again there are many other good things I like to think about myself. In any case, the COFFEE paper that has now about a 130 citations is the ancestor of T-Coffee as acknowledged by 130 gourmet bioinformaticians. It contains most of the original ideas about the new ways of evaluating alignments we had come out with. If we all have a paper we are secretly more fond of - a one that we find more personal - then COFFEE would be mine.

The time had come to turn this idea into a usable aligner, which COFFEE was not. COFFEE was a way of evaluating alignments, we call this an objective function, but it did not tell you how to build the alignment. For this, I had been using a genetic algorithm, that was marginally faster that a return journey between Oxford and Cambridge after British Railway privatisation - in case all you know about British trains comes from Sherlock Holmes, well, time have changed... I really had to get something faster, and scavenging the ClustalW algorithm turned out to be the best option - sorry Julie. I spent my last months at the EBI coding that stuff. This was intense enough so that most people I knew in Cambridge thought I was already gone, or dead, or something. A comment by a friend, on the day I was leaving England to defend my PhD in France, perfectly captured the whole process of wrapping up a PhD: “I wonder if this patch of hair on your forehead will ever grow back”. It did not.

The new T-Coffee was a highly sophisticated piece of code - euphemism for terrible - and I used most of my time in Switzerland in the group of Philipp Bucher to recode it. Philipp - and his funding agency - had been under the impression I would come working for the Prosite database. Over the few months I spent there, not only did I not do anything on Prosite - except messing up a few hyperlinks - but I also cannibalised Philipps attention to give suggestions on T-Coffee. For instance, he is the one who came out with the idea of combining local and global alignments. Why is he not on the paper? The only reason I can think of is because the project took three more years, three more labs, and three more countries to finalize. By that time, I had entirely lost track of who had done what to whom and vice versa (what? lab book? Are we supposed to have one of these?). I regret it but find some consolation in the idea that no one will be dropped out of a nobel because of this.

I left Switzerland for England with a first version of T-Coffee that was happily allocating the whole amount of RAM memory available in the UK at that time. This original version of T-Coffee is kept in a secret vault and considered a national security hazard. I will say no more. Fortunately, I then joined the group of Jaap Heringa in Willy Taylor’s program at the NIMRC. There I got myself in the best possible environment to clean up T-Coffee. It’s a long time I have not visited Mill Hill and I am not sure what is left of it - I know that Paul Nurse’s Cricky ambitions have shuffled things there quite a bit - but at the time the MRC was a place to drink a lot of coffee in the afternoon, beer on site in the local pub, talk a lot and be a scientist the way you had dreamt of becoming a scientist - in an absurdly under-assuming way, entirely captured by Michael Green and his monkey-based protein models. It took a good more year, and there we were, Des, Jaap and myself, fiddling with the first manuscript draft. We had it very clear: give me Science Magazine or give me death.

As we all know, the cool thing with these big journals is that you get fast rejection and move on, but if you don’t get a quick rejection, things get exciting. So we got excited, but for the wrong reason. Indeed I had left the MRC the day after submission and had taken a new appointment in France, in the lab of Jean Michel Claverie. Two months after submission we still had not received any rejection letter and I was beginning to browse Champagne millesimes. Unfortunately, the rejection letter had simply enjoyed a wet British summer, resting on my former desk...Yes, you know who you are if you read this, but thanks anyway for giving us so much hope and expectation over an entire summer...

This letter has long been lost and is not part of the file that I have posted here, neither is the Nature rejection. I seem to remember we tried Nature but I honestly could not find any trace of any failed attempt… Then again I do not have enough storage space for all my rejection letters... Our next best step was to be PNAS. And there things got hairy. It is a long time I have not submitted to PNAS, but in these days it was horrendous. You had to print things on american sized paper - a rare commodity in Europe. Then you had to use an esoteric formula to estimate your word counts while measuring figure sizes and margins, using some ratio of transcendental numbers for the final correction. Nothing was electronic and a typical submission would take you a couple of days while wasting paper worth about an acre of rainforest in the dry season.

But finally it was gone. It stayed there a couple of month until early December when I received this cryptic fax from PNAS.

That was just before Xmas and not the best time of the year to start running complex analysis but this was my chance, my break, my day, my year! I jumped into it with all the energy you have when below 30. I think we did a pretty good job at answering the reviews but Xmas is a bad time and the editor almost immediately rejected our paper while inviting us to re-submit. Unfortunately I have lost that one as well. We did so on the first days of February and the paper was smoothly and permanently rejected by PNAS early March. Looking back with my current experience, I think we should have fought a bit harder… Still I got mad about it, decided to dump everything in some archive and forget about it. It is Jaap who managed to convince me to keep it alive and go for JMB. Janet Thornton handled the manuscript and that was the smoothest ride any of my paper ever had. And that was it. T-Coffee was published. It came out on the 8th of September 2000.

For most projects, that’s where things stop, and then you move on. The 192 hours of teaching French assistant professors owe to the state quickly got all this research nonsense out of my mind. The next big milestone came two weeks later. It was an email with “T-Coffee” as subject and “Lipman, David (NIH/NLM/NCBI) [E]” in the “from” line. Yes the MAN himself. He was sending me a very polite e-mail asking things about T-Coffee. David Lipman was asking me questions about T-Coffee! I have to stress this one more time: Mr Blast was interested in half roasted T-Coffee. I have had other epiphanies but this one was intense enough to fry the hairpiece I had received as a PhD viva gift. We exchanged a bit more and David eventually invited me to visit the NCBI for a couple of weeks. I would need another long blog entry to to describe the visit of the holy temple of bioinformatics.

Among the many things that happened while visiting the NCBI, one took place that entirely changed the T-Coffee citation fate. Eugene Koonin. Most young scientists looking for Eugene papers probably think that this must be a very common name among Russian biologists. They probably assume an army of Eugene Koonin-s. Well I have some news for you. There is only one and yes he has done everything and the rest and a bit more. And it gets even more confusing if you consider that he is also a pretty normal human being - that is as far as bioinformaticians go of course... Visiting Eugene was too good to be true but it got even better when I realized he had the same Indy Silicon Graphic workstation on which I was developing T-Coffee (nice blue boxes). This matters because T-Coffee was - poorly - written in C and was only stable on this precise machine. I installed it on his machine, and ran T-Coffee on a couple of datasets. Eugene liked the alignments. I then bumped into David in the stairs who asked me how were things. I mentioned Eugene liking T-Coffee alignments. “He liked them!?” there was a mixture of suspicion, excitement and admiration in his tone.

Then I went back to my teaching in France and gradually forgot about all this. It took a good year for the first citation to come. It was Nick Grishin. Then Eugene and Aravind - the two most famous domain hunters - took on using T-Coffee on a regular basis, and everything started. If Koonin and Aravind are using an aligner, who would be crazy enough to use another one? Biology, especially wet lab, is pretty much a cooking exercise. You take a recipe and follow it line by line. If at line 5 the Chef says “Take T-Coffee”, then you take T-Coffee, not because you would blindly follow anyone, but because you know the Chef is good - you have eaten there food before.

One thing to know about methods papers is that they always start slowly and it often takes more than three years to get the first 30 citations. This is the reason why big journals don’t care about us - we don’t contribute much to the Holly Impact Factor. But once a method gets going, it can really rock, so be patient.

Well, I guess the time has come to wrap up this little piece with some element for the edification of young scientists. Well let me think... First don’t eat too much kebab or at least stay away from the variety loaded with french fries, secondly insist on getting fireproof hairpieces, thirdly make sure influential people know about your work - even if they eat kebab with you. We may like it or not but things in biology are very relative - social networks matter as much a gels do. Yes, there are people you trust and those you don’t. This is probably even more true now that automated metrics compilation systems can be fooled by anyone. With fake papers, fake reviewers, fake results, the only thing that’s left to us is reputation as supported by those with whom we get drunk. The rest is just damp octopus, squid and squib.