PolITiGenomics

Politics, Information Technology, and Genomics

Congratulations on reaching the $1000 genome

August 3rd, 2012

Recently there has been a spate of talks, press releases, and articles about the absurdity of the $1000 human genome, e. g., Cancer, Data and the Fallacy of the $1000 Genome. No doubt this has contributed to the somewhat muted response to Life Technologies announcement that they will attempt to win the using the Ion Torrent Proton platform. While I agree that talk of the $1000 human genome is irrelevant, it’s not for the same reason as everyone else. Most people cite sequence analysis costs, not typically included in the $1000 per genome estimate, as the reason that talk of a consumables-only $1000 genome is not relevant. That is a red herring (but more on that later). The real reason that the $1000 human genome is no longer interesting is because, for all intents and purposes, we have already achieved the $1000 human genome. “What?!?” you say, “a human genome costs $5000 to sequence!” Sure, you’re right, but that is just details. Compared to $1 billion (the approximate cost of the first human genome), the difference between $1000 and $5000 is rounding error. The reality is that the current cost of sequencing a human genome is well within the cost of diagnostic tests in common use in health care. From another perspective, the cost of sequencing a human genome has fallen into the range of an expensive vacation, i. e., there are people who at present are getting their genomes sequencing for recreational purposes. So, congratulations, we did it!

Now, what of all this talk of $1 million or $100,000 to analyze the sequence data from a human genome? Does analyzing sequence data from a single human genome cost that much? Well, it certainly can, but it need not. And particularly for the clinical market, it certainly won’t and it is preposterous to posit that it will. The confusion arises from people failing to distinguish between research and clinical analysis. While research projects on cancer are trying to better understand cancer, to expand our knowledge of the disease, the clinical application of genome sequencing to cancer, or any other disease for that matter, will be focused on improving diagnosis and treatment. The current reality is that we know precious little about how the genome works; and the ability to translate this information into improved diagnosis or connecting that diagnosis to a treatment is even less. In other words, the amount of actionable information one gets from the genome or transcriptome sequence is relatively small compared to the massive amount of presently uninterpretable information. Cancer research and more fundamental investigations into how the genome works (e. g., ENCODE) will expand our knowledge and potential actions over time, but for now there are only dozens to hundreds of possible actionable outcomes. Bottom line, the amount of analysis to convert genome sequence into these possible actions should not cost more than $100. So if you are a real stickler on getting the total cost of sequencing and analysis of a clinical human genome below $1000 (or less than one-third what Myriad charges to assay a few genes in their BRCA test), we just need to get the cost of sequencing less than $900 and we’re there.

This all is not to say that the cost of sequencing and analyzing a human genome does not matter; it most certainly does. In research, reducing the cost by a factor of two means you can double the number of samples you are able to sequence; potentially greatly increasing the power of your study. So we must continue to strive to reduce costs, but like I said, for all intents and purposes we have already hit an amazing milestone. So enjoy your weekend, you deserve it.


Pretty Vacant

July 13th, 2012

It’s been a long while since I posted, undoubtedly a bit too long for many to even be paying attention any more. I have a few ideas kicking around in my head where a blog post is likely a better forum than or . Until I wrestle them out, check out this article about some rewarding work we did last year to help treat a colleague’s cancer recurrence. Below is a short video about the effort.


Going Mobile

August 30th, 2010

I just added the WordPress Mobile Pack plugin to the site. When browsing from a mobile device, you should get a small-screen friendly view (and you won’t see the video below).


Striking at the root

August 3rd, 2010

If you are at all interested in a government of the people, by the people, and for the people, the presentation below by Lawrence Lessig is well worth the 18 minutes.


Brain Cloud

July 1st, 2010

Monya Baker recently published a Technology Feature in Nature Methods that discusses the use of cloud computing in genomics. I, along with several other people in the genome informatics community, were interviewed for the article. Until I saw the picture of Vivien Bonazzi in the article, I did not know she played guitar. I guess next time I am in DC I’ll have to challenge her to a guitar duel.

(Note: the video above is just an amusing example of a guitar duel, it is not intended in any way as a comment on Vivien’s or my personality. Vivien is great and me… well, you may have a point there. It is also worth noting that Bobby is not actually playing the “Holy Trinity of Rock ‘n’ Roll”, E-A-B. The chords being played are E-G-A E-G-B♭-A E-G-A-G-E. The older among us will recognize that as the same progression as the main riff from Deep Purple’s classic rock anthem Smoke On The Water.)

Over at , Matthew Dublin states in his that I want to bring everyone “back to square one” because I say that the solution to the computing challenges in genomics will likely involve a mixture of internal and external resources. The current reality is that most people are currently using local resources and, as those resources become more and more underpowered compared to their needs, they will extend their workflows to leverage external resources as well. In other words, researchers are not likely to scrap their current computing infrastructures and migrate entirely to the cloud when their computing needs grow beyond their existing resources. Hopefully by the time most people need to spill over into external resources middleware systems will exist that intelligently schedule jobs to appropriate computational resources, internal or external, with a minimal amount of job metadata from bioinformatician submitting the job.

Here’s a video hint for those who do not understand the reference in the title of this post.


Blame the predecessor

June 30th, 2010

Political commentators play the blame game. Don’t worry, it’s really no one’s fault.


Transcription and Translation

June 25th, 2010

Here a cool video from DNA: The Secret of Life detailing (in real time, just like the PacBio RS!) the central dogma of molecular biology.


The cost of doing sequencing

June 23rd, 2010

Whenever you get asked about a recent genome publication or the latest sequencing technology, the conversation invariably turns to cost. It turns out, cost is a tricky thing. When people talk of the “cost” of the Human Genome Project, they typically quote the cost for the entire project. A cost that includes sequencing instruments (several revisions), personnel, overhead, consumables, informatics, and IT. They contrast this rather large cost to the much lower cost of the $10,000 or $1,000 genome. However, in reality that “$10,000 genome” costs more than $10,000 (same goes for the $1,000 genome). You see, when people talk about the $10,000 genome, they are only accounting for the cost of consumables: flow cells and reagents. Perhaps this focus on consumables has its roots in the days of the Human Genome Project when reagent (BigDye®) costs dominated sequencing costs. Perhaps the focus is driven by marketers at the sequencing instrument companies who want to draw attention away from the six-figure sequencing instrument costs. Perhaps this focus is driven by the $10,000 recurring cost number specified by the , which receives much more attention than the $1 million direct cost cap. Regardless of the reason for the focus on consumables (likely some combination of all of the above), the reality is that consumable costs have fallen much more rapidly than any other cost associated with genome sequencing and can no longer be the only number quoted when stating the cost of a genome; at least if you want that number to actually mean anything.

So, what other costs should be considered? Well, the types of costs and actual values will depend greatly on your situation. Will you be doing the sequencing or will you be contracting at a core facility or sequencing-as-a-service company? Will you be doing the analysis or relying on a third party? How will you be validating your results? How many people will be working on the project at what percent of their efforts? Will you buy everyone a Pet Rock when the project reaches 1 exabases of sequence?

Here I’ll run through a standard cost calculation for a typical academic sequencing and analysis center to sequence and analyze a human genome. The names and costs have been changed to protect the innocent (this means I chose nice, round numbers that are the right order of magnitude). Why not use real numbers? Read the previous paragraph (I’ll wait …): your cost factors and numbers will not be the same as anyone else’s. So you’re going to have to do the calculation for yourself, not just lift the numbers from this post.

First we can consider the consumables (e.g., flow cells and reagents) costs. Let’s say those are $10,000. Then there is the instrument depreciation. Let’s say the instrument costs $600,000, has an expected life of three years, and can do 40 runs per year. Assuming a straight-line depreciation, the instrument depreciation per run is $5,000 (= $600,000 / (3 × 40)). If the instrument supports two flow cells, you would divide the number in half to get $2,500. Now, the DNA doesn’t just hop on the sequencer by itself. DNA has to be acquired, consents signed and approved by institutional review boards (IRBs), and sequencing libraries have to be made. Let’s say sample acquisition costs $100,000 for 50 samples; that’s $2,000 per sample. Shepherding the project and consents through the IRB takes one full-time employee (FTE) at 10% effort one month. We’ll say the cost of one FTE (salary, benefits, etc.) is $60,000 per year. So getting the project through IRB approval costs $500. If the project is able to use all 50 samples, that’s only $10 per sample! If the consumables and personnel time to make a sequencing library is $200, then the total production cost for sequencing our human genome is $14,710. Wait, I forgot the IT and LIMS support! In this scenario we’ll say that each instrument needs one IT FTE and one LIMS FTE, each at 25% effort ($750). And you need disk space for the data ($1,000, you can cut that in half if you throw away everything but the sequence, qualities, and alignments) and compute time ($100) to run alignments and QC. Add to that 50% overhead charges that your institution takes to cover administration, utilities, lab space, etc. (a company would need to determine each of these costs and add them in rather than this overhead multiplier) and your $10,000 genome costs you nearly $25,000. And you haven’t even called a variant yet.

Speaking of variants, let’s assume you want to call SNPs, indels, and structural variations. The first thing you will have to do is align your reads. Let’s say you are efficient and simply use the alignments from the production QC step. Above we assumed $100 for these alignments, but what goes into that number? First you have to determine an average alignment time per genome. Let’s say 90 Gb of sequence (30× coverage of a human genome) in 2×100 base read pairs takes 1,000 core×hr to align to the human reference genome. If you did this on Amazon EC2 ($0.17/core×hr), it would cost you $170 (plus data transfer and storage costs). If you have your own cluster, you need to amortize the cost of your cluster (compute nodes, racks, networking equipment and cabling, PDUs, etc.) per core×hr, add in the cost of your administrators per core×hr, and utilities or overhead per core×hr to get your cost. When you do that calculation, let’s say you get $0.10 per core×hr, so the alignment costs you $100 (but you already paid it above). Merging the BAM files from each lane’s worth of data and marking duplicates takes 50 hours, costing $5. Calling SNPs and indels (including reassembly) takes 100 hours, costing $10. Detecting structural variation using aberrant read pairs takes 200 hours, costing $20. Annotating all the variants across an entire genome takes 100 hours, costing $10. The disk space for all of this costs you $1,000 (again, you’ll need to calculate a cost per GB factoring storage, racks, switches, servers, personnel, etc. to get this number). Finally, somebody needs to run (or automate) this analysis pipeline. Figure that one analyst and one developer each at 10% effort can accomplish this over the course of two weeks; $480. Add all this up and your analysis with overhead runs you about $2300, or about 10% of the cost of generating the data. Of course, human resequencing for variant detection is not the only application of sequencing data. Other types of analysis, e.g., de novo assembly and metagenomic analysis, can have significantly higher costs per base. For example, in metagenomic analysis you may want to classify reads that do not align to known sequences by aligning them in protein space against a database like NCBI nr. If you generate 10 Gb of sequence per sample and 25% of the read pairs do not align to anything else, you will need to align 12.5 million reads. If you use the most common tool for this sort of alignment, NCBI BLAST+ blastx, it would take over 5,500 core×hr, costing about $550 by itself.

Now that you have your sequence data and list of variants, you are going to need to validate them. There are a lot of different ways to validate variants, e.g., PCR, pool, and sequence or Sequenom, so I am not going to go through a detailed cost calculation. It suffices to say that, depending on the number of variants you want to validate, the cost can rise into the thousands of dollars. Whatever platform you choose, you will need to go through a thorough cost calculation (like the one done above for the original sequencing and analysis). For the sake of this post, which is already too long, we’ll say the validation cost is $2,000.

Finally, somebody has to be running this show. Let’s say project management personnel costs $20,000, or $400 per sample. Put this all together and your $10,000 genome costs about $30,000. In other words, the often quoted consumables number only accounts for about 50% of the total cost (Note: overhead applies to consumables also, so while $10,000 looks like 1/3 of $30,000, it is actually half). Again, none of the numbers I use above are real (but they are in the ball park) and all sequencing and analysis facilities are going to have different contributors to their costs resulting in varying contributions from consumables. However, regardless of the cost contribution of consumables at present, the cost of consumables are projected to fall below $5,000 by the end of this year, and they won’t stop there. As such, it is already meaningless to only quote consumable costs when stating the price of sequencing a genome. By the end of the year, it will be ridiculous.

Update: Clarified Archon X Prize cost accounting.

BigDye is a registered trademark of Life Technologies.


Internally inconsistent

May 13th, 2010

The same news commentators who defended the previous administration’s shortcomings now use those same incidents to label the current administration’s difficulties (starting at about 5:00). So are they now saying that the previous administration had failures, or that the current administration is handling them well? I suppose what they are really saying is that there are limits to their powers of persuasion (“Rosebud!”).


Rationale of a human-truck hybrid

April 28th, 2010

Lawrence Lessig uses Sen. Scott Brown’s (R-MA) inability to explain why he opposes the financial reform bill as further reason that .

Scott Brown, Massachusetts’ new senator, opposes legislation in Congress that would strengthen regulations for Wall Street.

But when a reporter recently asked him why he’s against this bill, Brown couldn’t give an answer. He’s against financial reform, but he has no idea why.

Let me help Senator Brown: During his campaign last year, Brown received half of his campaign contributions from Wall Street and business executives. He benefited from another million dollars in issue ads by the U.S. Chamber of Commerce. They oppose the bill, so Senator Brown opposes the bill. It’s no wonder Pew recently found that trust in Congress is at its lowest point ever.

I focused on Scott Brown, but the influence of special interest money pervades both parties in both chambers. Americans are right to suspect that their representatives are merely doing the bidding of those funding their campaigns.

Last week, I recorded a new episode of the Change Congress Chronicles, talking about Scott Brown, the economy of influence in Washington, and the path to reform. Take a few minutes to watch, and then please share it with anyone you know who is fed up with our electoral system:

Congress can fix our campaign finance system right now by passing the Fair Elections Now Act, which would create an opt-in system of citizen-funded elections. But to get this bill written into law, we must build enough grassroots support so that Congress has no choice but to listen.

Whatever your party affiliation, whatever change you seek — it won’t happen until we Change Congress.

So head over to and take action.