Computational challenges in the future of large-scale sequencing

I'm preparing a plenary talk on Computational Challenges in the Future of Large-Scale Sequencing that I'll give at a workshop on The Future of the Large-Scale Sequencing Program, which the NIH National Human Genome Research Institute (NHGRI) is convening in two weeks (Bethesda, 23-24 March 2009). This is part of a long range planning process for NHGRI. This will be the one talk at the workshop that focuses on computational biology's future needs, so I'm aiming to present a community consensus view as guidance to NHGRI, not just my own view.

What do you think NHGRI should hear?

The main issues will be:

  • Data quality: Will we have adequate means to capture and encode quality measures on large scale datasets?
  • Data storage: Will labs have the infrastructure to store the data that they will generate in the next ten years? Do we need new methods and best practices in deciding what data is kept and what is thrown away?
  • Data communication: Will labs have the necessary network infrastructure to transfer data between labs, or between labs and public databases? Will we need new standards for datatypes, or even for data compression? Do we have the right infrastructure for public databases (at NCBI and elsewhere)?
  • Data analysis: Will labs have access to the computing hardware they need? To the analysis software they need? To the human expertise they need?
  • Data integration: Especially as NHGRI moves into large-scale human clinical sequencing, we will need to integrate sequence and medical phenotype data. What infrastructure will we need?

One thing I'll be trying to do is to project, at a high level, the exponential growth curves for various kinds of large-scale data (primarily sequence data, but also image and phenotype data associated with sequence-based assays, and curation and literature in the public sequence databases), versus exponential growth curves for storage, network, and processors, to identify the most important choke points in the ability of computational infrastructure to handle the job.

From that base, I'd like to identify opportunities for high impact investment by NHGRI in the general computational biology infrastructure. Should we invest in research in data storage and transmission algorithms, or is that all pretty obvious computer science that can just be applied? Should we invest in development of top-down data standards (including quality measures), or in ways to facilitate evolution and propagation of grassroots community standards? Should NHGRI take more of a role in investing in public software and algorithm development, or are there ways to facilitate more effective private investment? Should NHGRI fund better computational infrastructure (clusters and networks), and if so, at what level -- individual labs, departments, institutions, or NSF-like national computing centers?

The angle I'm likely to focus the most on is the democratization of large-scale sequencing. I believe that in the next ten years, it will (must!) become commonplace for the average R01-funded individual lab to use next generation sequencing technology and collect large-scale sequence information -- assembling a new species' genome for comparative analysis, mapping interesting traits by complete sequencing of many individuals in a population, mapping new mutants simply by complete genome sequencing. And not just genomic DNA sequence, but also all the applications of sequencing technology as a digital replacement for microarrays and gels: ChIP-Seq, RNA-Seq, and other such sequence tagging strategies. I don't think the infrastructure is ready for this at any level: we don't have best practices established for data storage or transmission, nor for public data archives, nor for journal publication; individual labs don't have sufficiently powerful hardware, robust software, nor knowledgeable humans to make the best use of the data. We're at a crossroads, where we are already pushing the technologies out of the genome centers and into individual labs, but that means that biology is going to have to establish the same sort of high-quality computational infrastructure that other fields have had to develop, such as high-energy physics.

Do you agree? Am I missing key issues? I'd like to hear feedback that I can incorporate into my talk, either in email or in comments here.

I would particularly like to have specific examples or specific data to back up any particular point. As a disciple of Tufte, I'd rather assemble substance, not bullet points.