PolITiGenomics

Politics, Information Technology, and Genomics

Bioinformatics and cloud computing

November 24th, 2009

From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference () to last month’s , everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg‘s group published a paper on their Crossbow tool entitled (). In the paper the authors describe how they were able to analyze the human sequence data published last year by BGI using . Specifically, they have developed an alignment () and SNP detection () pipeline that is executed in parallel across a cluster using the Hadoop framework (a implementation of framework). Using a 40-node, 320-core EC2 cluster, they were able to analyze 38× coverage sequence data in about three hours. The whole analysis, including data transfer and storage on , cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr’s HPCInfo post and more detail on the SNP detection on Dan Koboldt’s Mass Genomics post.

For analyzing a single genome, you really can’t beat that price. Of course, at the rate next-generation sequencing instruments are generating data, most people are not going to want to analyze just one genome. So the question becomes, what is the break even point? That is, how many genomes do you have to sequence to make buying compute resources cheaper than renting them from Amazon? We currently estimate that the fully loaded (node, chassis, rack, networking, etc.) cost of a single computational core is about $500. Thus, to purchase 320 cores would cost you about $160,000. It’s going to take a lot (1280) genomes to hit that break even point. But, do you really need to analyze a genome in three hours? With the current per run throughput of a single Illumina GA IIx, it would take about four ten-day runs (40 days) to generate 38× coverage of a human genome. After each run, you could align the sequence data from that run. Each lane of data would take 8-12 core·hours to align, so a whole run’s (eight lanes’) worth of data would take about 80 core·hours. Therefore, even if you had just one core, you could align all the data before the next run completed. The consensus calling and variant detection portions of the pipeline typically take a handful of core·hours and therefore do not change the economics; they too can be completed before the first run of the next genome is completed. Thus, with a $500 investment in computational resources, you can more than keep pace with the Illumina instrument. Note that I am completely excluding the cost of storage, as that will be needed for the data and results regardless of where the computation is done. Of course, you probably wouldn’t buy just one core. Checking over at the , you can get a Quad Core Precision T3500n with 4 GiB of RAM (more RAM per core than the used in the paper) and 750 GB local storage capacity (about the same storage per core as the Extra Large Instance) for $1700. You would need less than one core’s (25%) of that workstation’s capacity dedicated to alignment of and variant detection on data from a single Illumina GA IIx (thanks to Burrows-Wheeler Transform aligners like bowtie and ). Using the single core numbers, the break even point for purchase versus cloud is less than five whole genomes. Using the entire cost of the Dell workstation (even though you require less than 25% of its computational capacity), the break even point is about 14 genomes. It would take about 1.5 years (about half the expected life of IT hardware) at current throughput to sequence 14 genomes with a single Illumina GA IIx. At data rates expected in January 2010, it would take less than a year to break even.

These numbers indicate that unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster. With the proliferation of sequencing applications and publications in the last couple years, not many researchers will fall into the “few genomes” bin. Our experience has been that the more sequencing data people get, the more they want. Another way to look at this is that the entire analysis computational hardware costs (<$1700) is less than 1% of the sequencing instrument cost; or the computational cost to analyze a whole genome (<$500) is less than 1% of the total data generation costs (reagents, flow cells, instrument depreciation, technician time, etc.). This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post.

Posted in genomics, IT | 9 Comments »

Tagged with: , , , , , , ,


You can follow any responses to this entry through the feed. You can leave a response, or trackback from your own site.

9 Responses to “Bioinformatics and cloud computing”

  1. David, you beat me to the financial analysis post! Nice analysis. One thing that might change the calculation is taking into account the cost on fewer EC2 cores. As you point out, we may not need to finish in 2.8 hours.

    As you may have see in the last paragraph of my post (http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/) a significant premium is paid to get this done in 2.8 hours using 320 cores instead of 6.5 hours using 80 cores. Due to the non-linear scaling, the cost per-hour goes from around $8 per hour to around $29 per hour of elapsed time (this is using the EC2 costs only whereas you are counting the storage costs as well in the $125).

    Also, lacking an analysis of the CPU usage efficiency on the EC2 nodes, one cannot necessarily say that we’d need the same quantity of cores to complete the analysis in the same time frame.

  2. Ben Langmead Says:

    Hi David,

    Analyses like the above are, I think, a great way of advancing the field’s conversation about cloud computing. I’m really glad you’re taking it up.

    My main comment is that you’re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers. That isn’t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn’t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.

    To someone facing larger barriers, the fact that computation costs so much less than the sequencing machine (or more importantly, the sequencing consumables) will probably push them toward the cloud rather than away. Your situation is relatively special; The Genome Center has an existing, large pool of computational power, steady work, and a lot of like-minded sequencing people under one roof. Academics don’t necessarily have any of those things.

    I’ve only heard anecdotal accounts of people calculating recurring-cost comparisons for local vs. cloud, and I’m told that cloud beats local by 2 or 3x. That’s secondhand, so I hope you’ll try it yourself.

    Thanks for the interest – I look forward to the future posts.

    Best,
    Ben

  3. If most of biological applications could be done in cloud, cloud computing would be really promising. Unfortunately, there are too few cloud applications. If we want to do something on local computers (e.g. image analysis, base calling and post alignment analysis) and something else in cloud (e.g. alignment), why not run everything locally as we do not need too much more resources given the current advance in alignment algorithms? Furthermore, cloud computing greatly raises barrier for software development. Most developers would be reluctant to spend a lot of time on learning hadoop when they get their algorithms working locally, which deteriorates the situation.

    In my view, cloud computing can only be popular when someone design a generic modular framework. In a simpler case of crossbow, I think it would be essential for it to allow other aligners/SNP callers to be plugged in. Crossbow can define the interfaces or the required command-line options and any aligners/SNP callers that implement this interface can run in a cloud. It would be even better to define a more generic interface for other applications such that a command-line tool can be used in a cloud. This will be harder, though.

  4. Clive G. Brown Says:

    Hi David,

    An excellent analysis – I am also a bit of a cloud scpetic (somebody has to be). You are absolutely right, there is enough time to analyse a GA run on a pretty cheap computer. Even with storage, as long as you dont keep more than a couple of runs worth of raw data, you are looking at a relatively cheap system. Amortise that over the lifetime of the instrument (or number of runs) and it comes out pretty competitive at around $150 per run all in (yes with heat and power etc). (This, and yours, are a class of calculation that goes right back to the early days of Solexa – nothing new).

    Even if cloud is a bit cheaper, we’re clearly not talking 10-100X – there’s a lot of wastage elsewhere in most operations that can drown out such small cost savings – like failed runs, libraries and reagents kits which can easily add up to many thousands (not to mention employee costs which are always the largest).

    (There are also benefits to owning the hardware, like control – and Im not convinced that all of that software can be run concurrently, with many different resource usage profiles, in way that maintains the low costs and short execution times, at least not without a lot of re-writing of software. A lot of next-gen seq bioinf apps have heavy and demanding IO requirements, im not sure that can be generically abstracted in a way that ensures every user gets what they want all of the time…..)

    c

  5. Won’t it be great when we can just type this :

    cat *.fq | bwa | sam2bam | ./findsnps | ./filter4reality | sort |uniq -c | sort -n | head

    then go check your fave blog for a minute or two, come back and see the results.

    someday.

  6. Sucha Sudarsanam Says:

    One important aspect of cloud computing is not technical at all, not how much CPU, memory, storage, etc. There is an important social aspect. Any one who wants to try a novel computational idea or start a business based even on an existing method can now do so with minimal friction. If the idea works it is great otherwise you turn off the server in the cloud and walk away.

    My hope is that cloud computing will help generate innovative ideas and help to implement novel business plans.

  7. [...] I’m working with Eucalyptus learning how to set it up, and then configure a slim Linux image that could be scaled out. From there, add the useful applications to it, make it a template others could use on their own Euca setups, or EC2, or both, to do map/reduce, or whatever work they want. This is where my expertise ends, I just want to facilitate the community to be able to get to that point. But, to address that point – I sent an email out to the group: “All — Nick posted this to Twitter, but I wanted to highlight it for everyone here https://politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html [...]

  8. I had the opposite reaction — the cluster pricing from Amazon seems like a bargain.

    To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost. For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)? How much space do they take up? The power for these beasts is not inconsiderable.

    When you run machines hammer and tongs with disks flying and memory working full tilt, they tend to wear out pretty quickly.

    My wife’s having trouble with her cluster at NYU because the building’s heating and cooling are both tied to the same faulty plumbing system; so even though it’s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two. Just like when the AC went out in the summer.

    NYU’s machines are also prone to infection by viruses. They had to completely rebuild their SOLiD cluster for that reason, which also set them back in time and money. It’s such a huge problem that SOLiD service reps just show up with giveaway 1GB thumb drives they won’t even take back.

    Amazon’s pricing seems to have gone down from what you’re quoting. Amazon’s EC2 extra-large instance gives you a four-core 15GB machine for US$0.68/hour, or a two-core, 8GB machine for half that. If you can get away with a 32-bit OS on a single core, it’s only US$0.085/hour. That translates into 100 days (400 core days), 200 days (400 core days), and 800 days for $1700.

    Do you really only run an analysis once? I see people continually rerunning with different settings, different software, against different assemblies, with public (e.g. GEO) data, etc. All of which requires more overhead at particular times, not necessarily a huge cluster all the time.

  9. Bob, regarding the costs, I used the same numbers you are. The analysis in the paper took 320 cores × 3 hr = 960 core×hr and did require the large instance (because of the memory requirements of bowtie). Those 320 cores were spread across 40 instances, so the computational cost was 40 × 3 hr × $0.68/core×hr = $81.60. The additional cost was for data transfer and storage in S3, bringing it up to about $125.

    For responses to other parts of your comment (and other people’s comments), see my subsequent post, Head in the clouds.

Leave a Reply