I just added the WordPress Mobile Pack plugin to the site. When browsing from a mobile device, you should get a small-screen friendly view (and you won’t see the video below).
August 30th, 2010 dd Posted in genomics, IT, politics No Comments »
I just added the WordPress Mobile Pack plugin to the site. When browsing from a mobile device, you should get a small-screen friendly view (and you won’t see the video below).
July 1st, 2010 dd Posted in genomics, IT 1 Comment »
Monya Baker recently published a Technology Feature in Nature Methods that discusses the use of cloud computing in genomics. I, along with several other people in the genome informatics community, were interviewed for the article. Until I saw the picture of Vivien Bonazzi in the article, I did not know she played guitar. I guess next time I am in DC I’ll have to challenge her to a guitar duel.
(Note: the video above is just an amusing example of a guitar duel, it is not intended in any way as a comment on Vivien’s or my personality. Vivien is great and me… well, you may have a point there. It is also worth noting that Bobby is not actually playing the “Holy Trinity of Rock ‘n’ Roll”, E-A-B. The chords being played are E-G-A E-G-B♭-A E-G-A-G-E. The older among us will recognize that as the same progression as the main riff from Deep Purple’s classic rock anthem Smoke On The Water.)
Over at , Matthew Dublin states in his that I want to bring everyone “back to square one” because I say that the solution to the computing challenges in genomics will likely involve a mixture of internal and external resources. The current reality is that most people are currently using local resources and, as those resources become more and more underpowered compared to their needs, they will extend their workflows to leverage external resources as well. In other words, researchers are not likely to scrap their current computing infrastructures and migrate entirely to the cloud when their computing needs grow beyond their existing resources. Hopefully by the time most people need to spill over into external resources middleware systems will exist that intelligently schedule jobs to appropriate computational resources, internal or external, with a minimal amount of job metadata from bioinformatician submitting the job.
Here’s a video hint for those who do not understand the reference in the title of this post.
June 23rd, 2010 dd Posted in genomics, IT 12 Comments »
Whenever you get asked about a recent genome publication or the latest sequencing technology, the conversation invariably turns to cost. It turns out, cost is a tricky thing. When people talk of the “cost” of the Human Genome Project, they typically quote the cost for the entire project. A cost that includes sequencing instruments (several revisions), personnel, overhead, consumables, informatics, and IT. They contrast this rather large cost to the much lower cost of the $10,000 or $1,000 genome. However, in reality that “$10,000 genome” costs more than $10,000 (same goes for the $1,000 genome). You see, when people talk about the $10,000 genome, they are only accounting for the cost of consumables: flow cells and reagents. Perhaps this focus on consumables has its roots in the days of the Human Genome Project when reagent (BigDye®) costs dominated sequencing costs. Perhaps the focus is driven by marketers at the sequencing instrument companies who want to draw attention away from the six-figure sequencing instrument costs. Perhaps this focus is driven by the $10,000 recurring cost number specified by the , which receives much more attention than the $1 million direct cost cap. Regardless of the reason for the focus on consumables (likely some combination of all of the above), the reality is that consumable costs have fallen much more rapidly than any other cost associated with genome sequencing and can no longer be the only number quoted when stating the cost of a genome; at least if you want that number to actually mean anything.
So, what other costs should be considered? Well, the types of costs and actual values will depend greatly on your situation. Will you be doing the sequencing or will you be contracting at a core facility or sequencing-as-a-service company? Will you be doing the analysis or relying on a third party? How will you be validating your results? How many people will be working on the project at what percent of their efforts? Will you buy everyone a Pet Rock when the project reaches 1 exabases of sequence?
Here I’ll run through a standard cost calculation for a typical academic sequencing and analysis center to sequence and analyze a human genome. The names and costs have been changed to protect the innocent (this means I chose nice, round numbers that are the right order of magnitude). Why not use real numbers? Read the previous paragraph (I’ll wait …): your cost factors and numbers will not be the same as anyone else’s. So you’re going to have to do the calculation for yourself, not just lift the numbers from this post.
First we can consider the consumables (e.g., flow cells and reagents) costs. Let’s say those are $10,000. Then there is the instrument depreciation. Let’s say the instrument costs $600,000, has an expected life of three years, and can do 40 runs per year. Assuming a straight-line depreciation, the instrument depreciation per run is $5,000 (= $600,000 / (3 × 40)). If the instrument supports two flow cells, you would divide the number in half to get $2,500. Now, the DNA doesn’t just hop on the sequencer by itself. DNA has to be acquired, consents signed and approved by institutional review boards (IRBs), and sequencing libraries have to be made. Let’s say sample acquisition costs $100,000 for 50 samples; that’s $2,000 per sample. Shepherding the project and consents through the IRB takes one full-time employee (FTE) at 10% effort one month. We’ll say the cost of one FTE (salary, benefits, etc.) is $60,000 per year. So getting the project through IRB approval costs $500. If the project is able to use all 50 samples, that’s only $10 per sample! If the consumables and personnel time to make a sequencing library is $200, then the total production cost for sequencing our human genome is $14,710. Wait, I forgot the IT and LIMS support! In this scenario we’ll say that each instrument needs one IT FTE and one LIMS FTE, each at 25% effort ($750). And you need disk space for the data ($1,000, you can cut that in half if you throw away everything but the sequence, qualities, and alignments) and compute time ($100) to run alignments and QC. Add to that 50% overhead charges that your institution takes to cover administration, utilities, lab space, etc. (a company would need to determine each of these costs and add them in rather than this overhead multiplier) and your $10,000 genome costs you nearly $25,000. And you haven’t even called a variant yet.
Speaking of variants, let’s assume you want to call SNPs, indels, and structural variations. The first thing you will have to do is align your reads. Let’s say you are efficient and simply use the alignments from the production QC step. Above we assumed $100 for these alignments, but what goes into that number? First you have to determine an average alignment time per genome. Let’s say 90 Gb of sequence (30× coverage of a human genome) in 2×100 base read pairs takes 1,000 core×hr to align to the human reference genome. If you did this on Amazon EC2 ($0.17/core×hr), it would cost you $170 (plus data transfer and storage costs). If you have your own cluster, you need to amortize the cost of your cluster (compute nodes, racks, networking equipment and cabling, PDUs, etc.) per core×hr, add in the cost of your administrators per core×hr, and utilities or overhead per core×hr to get your cost. When you do that calculation, let’s say you get $0.10 per core×hr, so the alignment costs you $100 (but you already paid it above). Merging the BAM files from each lane’s worth of data and marking duplicates takes 50 hours, costing $5. Calling SNPs and indels (including reassembly) takes 100 hours, costing $10. Detecting structural variation using aberrant read pairs takes 200 hours, costing $20. Annotating all the variants across an entire genome takes 100 hours, costing $10. The disk space for all of this costs you $1,000 (again, you’ll need to calculate a cost per GB factoring storage, racks, switches, servers, personnel, etc. to get this number). Finally, somebody needs to run (or automate) this analysis pipeline. Figure that one analyst and one developer each at 10% effort can accomplish this over the course of two weeks; $480. Add all this up and your analysis with overhead runs you about $2300, or about 10% of the cost of generating the data. Of course, human resequencing for variant detection is not the only application of sequencing data. Other types of analysis, e.g., de novo assembly and metagenomic analysis, can have significantly higher costs per base. For example, in metagenomic analysis you may want to classify reads that do not align to known sequences by aligning them in protein space against a database like NCBI nr. If you generate 10 Gb of sequence per sample and 25% of the read pairs do not align to anything else, you will need to align 12.5 million reads. If you use the most common tool for this sort of alignment, NCBI BLAST+ blastx, it would take over 5,500 core×hr, costing about $550 by itself.
Now that you have your sequence data and list of variants, you are going to need to validate them. There are a lot of different ways to validate variants, e.g., PCR, pool, and sequence or Sequenom, so I am not going to go through a detailed cost calculation. It suffices to say that, depending on the number of variants you want to validate, the cost can rise into the thousands of dollars. Whatever platform you choose, you will need to go through a thorough cost calculation (like the one done above for the original sequencing and analysis). For the sake of this post, which is already too long, we’ll say the validation cost is $2,000.
Finally, somebody has to be running this show. Let’s say project management personnel costs $20,000, or $400 per sample. Put this all together and your $10,000 genome costs about $30,000. In other words, the often quoted consumables number only accounts for about 50% of the total cost (Note: overhead applies to consumables also, so while $10,000 looks like 1/3 of $30,000, it is actually half). Again, none of the numbers I use above are real (but they are in the ball park) and all sequencing and analysis facilities are going to have different contributors to their costs resulting in varying contributions from consumables. However, regardless of the cost contribution of consumables at present, the cost of consumables are projected to fall below $5,000 by the end of this year, and they won’t stop there. As such, it is already meaningless to only quote consumable costs when stating the price of sequencing a genome. By the end of the year, it will be ridiculous.
Update: Clarified Archon X Prize cost accounting.
BigDye is a registered trademark of Life Technologies.
April 21st, 2010 dd Posted in genomics, IT No Comments »
A previous cloud post, Puff piece, has gotten a bit of attention from and . While the Informatics Iron piece was positive, Mr. Stowe took issue with some of the points I made. First, he says that my claim that IT and software engineering is needed to get things running on the cloud is inaccurate.
You are implying that to get running in the cloud, an end user must worry about the “IT expertise” and “software engineering” needed to get applications up and running. I believe this is a straw-man, an incorrect assertion to begin with.One of the major benefits of virtualized infrastructure and service oriented architectures is that they are repeatable and decouple the knowledge of building the service from the users consuming it. This means that one person, who creates the virtual machine images or the server code running the service, does need the expertise to get an application running properly in the cloud. But after that engineering is done once, a whole community of end-users of that service can benefit without knowledge of the specifics of getting the application to scale.
For example, does everyone that uses GMail/Yahoo/Hotmail know every line of software code to make it run? Do they know every operational aspect of how to make mail scale to tens of thousands of processors across many data centers?
Definitely not, and the point is they don’t have to. The same is true for high performance and high throughput computing. To give examples of free services that don’t require end user software engineering or IT expertise to do bioinformatics/proteomics/etc.:
- The NIH Website for BLAST has, for years, been running BLAST as a service so that researchers can use GUIs to run queries on parallel back-end infrastructure (see http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) This requires no complicated knowledge or software engineering for scientists to run BLAST as a Service.
- Tools like ViPDAC have 2-minute tutorial videos to run proteomics on Amazon Web Service.
His argument is absolutely correct when dealing with established systems, applications, and work flows. For use cases like email and running BLAST, there is no need for additional software engineering or IT expertise (other than getting on the internet). In fact, The Genome Center has long offered a for anyone to use. Further, over the past few weeks, several prepackaged bioinformatics work flows that run on the cloud (or some approximation thereof) have been announced: Mr. Stowe’s company Cycle Computing announced CycleCloud for Life Sciences, , from Bio-Team, ChIP-seq and RNA-seq analysis pipelines from DNAnexus, the work flows available in Galaxy, and of course the previously published . Unfortunately, canned analyses are not the norm in bioinformatics. Bioinformaticians love to tinker, trying to get just a little more biological information out of their data sets. The result is that bioinformatics applications and work flows are constantly being tweaked, updated, and improved. Because of this, maintenance of these pipelines is a huge burden. The supporters of these generic pipelines must work constantly to update and verify software or the users will constantly be waiting for the latest fix to be applied or latest feature to be available (anyone who installs each new version of velvet can attest to this). The saving grace in all of this is that as the use of sequencing becomes more widespread, the percentage of the people doing the analysis that classify as bioinformaticians will decrease (greatly). This means that a larger and larger percentage of people with sequence data to analyze will likely not be interested in tweaking analysis pipelines but will just want to run something and get an answer. It is this ever growing group of people that will greatly benefit from easy to use analysis tools, whether they be deployed on the cloud or not. Both Mr. Stowe and I agree that creating easy to use tools for non-bioinformaticians to use is a very worthwhile goal. Unfortunately the proliferation of existing tool options (e.g., maq, bwa, bowtie, bfast, soap, novoalign, etc.) now layered with a proliferation of cloud offerings will make it even more difficult for non-experts to chose which pipeline is the best to use. Therefore approaches like those taken by Cycle Computing and GenomeQuest that provide default analysis pipelines and the ability for bioinformaticians to create and share their own work flows are the most likely to be successful. The development of these generic, distributed analysis frameworks that also provide useful defaults is an even more worthwhile goal because it achieves two important ends: ease of use for non-experts and the ability for bioinformaticians to tinker. Bioinformaticians are more likely to find tools like these useful and therefore will be early adopters, choose the best platforms, establish best-practices on these platforms, publish results using these platforms, and then the non-experts will follow.
Mr. Stowe’s other objection related to my point that no process scales linearly with the number of cores. He concedes that point but points out
In fact, regardless of whether the job is linearly scalable, most companies and research institutions don’t have 1 cluster to 1 user scenarios. There are multiple users with multiple jobs each. What if you have 10 crossbow users with 10 runs to do on various genomes? Then you can get 100x performance on the *workflow as a whole*.
Again, this is true, but, to be fair, that is not the same point he made in his original article. His original point was that if you needed your analysis to run faster you could just provision more nodes. I just pointed out that this is true, but you would likely pay a premium for that because nothing scales linearly. It may seem like a fine distinction, but with all the misinformation around clouds nowadays, it’s an important one to make. It should also be noted that without good software engineering and system administration, even algorithms that should scale nearly linearly might not. The take-home message is that if someone has done that software engineering and systems administration work to make a program scale well and run well in a cloud envrionment and made it available to you, great. If not, someone is going to have to do it.
I had the opportunity to meet Mr. Stowe at the XGen Congress and have talked more with him this week at Bio-IT World Conference and Expo (my talk is tomorrow at 11 a.m. EDT in Track 3: Bioinformatics and Next-Gen Data). We had a good discussion about cloud computing and its role in bioinformatics (they’ve got a cool solution to the Amazon storage problem). As you can hopefully tell from this post, we are largely in agreement: engineering is needed, but once it is done, everyone benefits. Cycle Computing certainly has a lot of good expertise in the cloud, so if you need some engineering done, shoot him an email. Unfortunately, they probably will not be able to help you access the .
March 10th, 2010 dd Posted in genomics, IT No Comments »
If you are going to be at XGen next week and you are interested in cloud computing and its application to bioinformatics, be sure to stop and participate in the Cloud Computing in Bioinformatics discussion I will be “facilitating” on Wednesday morning (March 17). My talk is at 3:05 p.m. PT on Tuesday and I will be chairing the first session on Monday (if my plane is on time and the taxi is fast enough).
March 10th, 2010 dd Posted in genomics, IT 1 Comment »
recently received word that its grant proposal for a data center was approved (St. Louis Business Journal). The $14.3 million grant is funded by and the money comes from . The grant, along with about $8 million dollars from Washington University, will allow us to essentially duplicate our current data center capacity. We took possession of our current data center in May 2008 and it is already 80-90% full, so this new data center will greatly help us to keep pace with all of the we are undertaking.
February 24th, 2010 dd Posted in genomics, IT No Comments »
I recently did an interview in advance of my talk at the XGen Congress next month in San Diego. The interview is about 14 minutes and discusses our work at in general and more specifically the software and IT infrastructure we have created to enable the analysis of the massive amounts of sequence data we generate. The interview is available to download as part of the XGen Congress podcast series.
February 19th, 2010 dd Posted in genomics, IT 6 Comments »
I updated the Next-Generation Sequencing Informatics table a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the Illumina GA IIx. Also, the Sides & Associates blog linked to my table and referred to it as a “.” Just to clarify, this table represents average throughput for production systems; not vendor claims about throughput, not future vaporware (and Alejandro Gutierrez corrected his description in the post once I pointed this out). As new systems come online and further improvements are made to existing platforms, the table will be updated.
February 16th, 2010 dd Posted in genomics, IT 1 Comment »
Why should one be skeptical of all the information touting the wonders of cloud computing? This older, in-depth piece by Gartner, Hype Cycle for Cloud Computing, 2009, lays out the reasons pretty well. But one need not spend that much time reading about it. You can simply read this much shorter piece by Jason Stowe: Is the Future Of High- Performance Computing For Life Sciences Cloudy? Reading that story, one can only get the impression that the cloud is some panacea where all computational problems are solved. In fact, the picture is so rosy that one may become suspicious. So suspicious that one may read the About the Author section at the bottom of the piece an see that Mr. Stowe happens to be CEO of a company selling cloud computing services.
Jason Stowe is the founder and CEO of Cycle Computing, a provider of high-performance computing (HPC) and open source technology in the cloud. A seasoned entrepreneur and experienced technologist, Jason attended Carnegie Mellon and Cornell Universities.
No wonder he makes cloud computing sound so attractive. No mention of the IT expertise needed to get up and running on the cloud. No mention of the software engineering needed to ensure your programs run efficiently on the cloud. It may not be apparent from his article, but a program that runs well on one or ten computers does not necessarily run well on hundreds of computers. In fact, he implies the exact opposite.
For compute clusters as a service, the math is different: Having 40 processors work for 100 hours costs the same as having 1,000 processors run for 4 hours.
It may cost the same under that scenario, but not everything scales linearly. In fact, most things don’t and that less-than-linear scaling actually ends up making it cost more to get a shorter turnaround. This fact was clearly evident in the where it cost $52 to complete the analysis in 6.5 hours but $84 to finish it under 3 hours (Table 4). The article fails to mention this; a marvel given the fact that the lack of good, scalable bioinformatics tools that can run well in highly parallel environments is perhaps the largest impediment to the adoption cloud computing in bioinformatics. Of course, I am sure he will gladly sell you consulting services that will get you up and running on the cloud. In short, this looks like a shill.
Unfortunately, omitting information is not the only problem with many of the stories about cloud computing; many also contain misinformation. For example, the story Gathering clouds and a sequencing storm in Nature Biotechnology mentions the software engineering challenges but erroneously states
…bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud
What?!? You do not have to develop tools using Hadoop. Sure it is a nice platform that provides fault-tolerant parallelism, but it is by no means required by any cloud provider that I know of (not even Google, whose MapReduce framework provided the model for Hadoop!) nor is it the only way to achieve parallel processing (far from it). Amazon EC2 just provides you with a virtual machine with a basic operating system installed on it and remote access. You can do whatever you want with it after that. Google and Microsoft do require that you develop your code in their cloud framework, but you do not have to use Hadoop. For information on what you do have to do to run jobs on the major cloud providers, check out this article by Udayan Banerjee, , and each providers web site: , , and Microsoft Windows Azure.
(How many bad cloud puns can I work into post titles? Stay tuned.)
January 28th, 2010 dd Posted in genomics, IT 1 Comment »
There is a story about regional data centers in the Jan/Feb 2010 issue of St. Louis Commerce Magazine that includes a section on our Genome Data Center; the only regional data center to achieve certification (and gold at that!). Unfortunately, the issue seems to only be available as part of a Flash application, so I cannot link to the story, only to the issue and tell you that the data center story starts on page 62 and the Genome Data Center section is on page 64 (it includes !). This issue of the magazine also includes stories on cloud computing and Washington University in St. Louis Chancellor Mark Wrighton (and high-speed rail of course).