The next five years of computational genomics at NHGRI

240px-us-nih-nhgri-logosvg

The NIH National Human Genome Research Institute is going through a process of making plans for the next five years of research in genomics, including computational genomics. In the past couple of months I’ve been at two planning meetings – the Cloud Computing workshop (31 March – 1 April) and most recently the NHGRI Informatics and Analysis Planning Meeting (21-22 April). Goncalo Abecasis and I got the job of trying to summarize the consensus of the Informatics and Analysis meeting. I’m not sure how good I am at identifying consensus, but I just sent off four pages of notes to Vivien Bonazzi, one of NHGRI’s informatics program officers, describing some of my personal views of the future, strongly colored by the discussions at the planning meetings. I thought I’d share the same comments here. Transparency in government and all that. Just imagine all the potential conflicts of interest here; good thing I’m paid by a dead billionaire, not so much by federal tax dollars.

Anyone with an interest in how the sausage is made at NIH might want to peek under the hood, below.

Plan explicitly for sustainable exponential growth. We keep using metaphors of data “tsunamis” or “explosions”, but these metaphors are misleading. Big data in biology is not an unexpected disastrous event that we have to clean up after. The volume of data will continue to increase exponentially for the foreseeable future. We must make sober plans for sustainable exponential growth (this is not an oxymoron).

Use convening power to connect to outside expertise. The wider world of computing research is now well aware that biology is a major growth area. It was great to see organizations like Amazon, Google, and Microsoft represented at this meeting, as well as several of the DOE national laboratories and other academic HPC experts. NHGRI should aggressively expand its efforts to build connections not only cross-institute, but cross-agency (especially DOE, but also DoD, Agriculture, and other agencies that do high performance computing in biology), and into the parts of the commercial sector (Microsoft, Google, Amazon) where there is substantial expertise and interest in large scale data analysis. NHGRI’s “convening power” is a way to effect change while making the most of limited funding resources.

Focus on democratization. A large challenge facing us is that large-scale data generation methods — sequencing, mass spec, imaging, and more — are increasingly available to individual investigators, not just large NHGRI-funded genome centers. To enable individual investigators to make effective use of large datasets, we must create an effective infrastructure of data, hardware, and software. NHGRI has extensive experience in big data, and can lead and catalyze across the NIH.

Datasets should be “tiered” into different appropriate representations of different volumes. Currently we tend to default to archiving raw data – which not only maximizes storage/communication challenges, but also impedes downstream analyses that require processed datasets (genome assemblies, sequence alignments, or short reads mapped to a reference genome, rather than raw reads). We should push to develop different levels of standardized representations of sequence data, including:

  1. Data structure(s) for lossless representation of a large number of genome sequences from a population (such as thousands of human genomes) by differences amongst them (“ancestral variation graphs”, AVGs, were mentioned at the meeting).
  2. Data structure(s) for lossless representation of a large number of short reads from a ChIP-seq or RNA-seq type experiment by position/difference against a stable reference genome.
  3. Data structures for histograms of short reads mapped to a reference genome coordinate system for a particular ChIP-seq or RNA-seq experiment. Many analyses of ChIP-seq and RNA-seq data don’t need the actual reads, only the histogram.

A principle challenge in creating these tiers is the appropriate representation of uncertainty and data quality from a more raw tier to a more processed tier.

Plan for more kinds of data than genome sequence. Mass spectroscopy and imaging also generate large datasets. NHGRI is naturally most focused on sequence data, but the same lessons we learn in handling sequence data need to be applied in other areas.

Reduce waste on subscaled computing infrastructure. NIH-funded investigators with large computing needs are typically struggling. Many are building inefficient computing clusters in their individual labs. This is the wrong scale of computing, and it is wasting NIH money. Several competing forces are at work:

  1. Racked clusters are the current state of the art for storage and computing infrastructure. These facilities require specialized facilities for space, power, and cooling, and highly skilled people to design, build, and manage them. Few individual investigators have the critical mass to justify the expense or to do this well; well-run clusters that run with reasonably high loads are, at minimum, department- or institute-level resources.
  2. At the same time, overly centralized (regional or national) supercomputing resources don’t work well in this field. High performance computing in computational biology has a fundamentally different workflow than HPC in other fields. We face problems of integrating numerous unreliable, complex, rapidly changing datasets. Analyses often depend on custom programs that are used for a day and thrown away. We need to be intimately, interactively close to our data analyses. The latency of pushing individual commands and data to and from a remote computing facility is too great a burden; time and again, we find in actual experience that computational biologists simply do not use remote computing facilities, preferring to use local ones. HPC experts in other fields tend to assume that this reflects a lack of education in HPC, but many computational genomicists have extensive experience in HPC. I assert that it is in fact an inherent structural issue, that computing in biology simply has a different workflow in how it manipulates datasets. For example, a strikingly successful design decision in our HHMI Janelia Farm HPC resource was to make the HPC filesystem the same as the desktop filesystem, minimizing data transfer operations internally.
  3. Clusters have short lives: typically about three years. They are better accounted for as recurring consumables, not a one-time capital equipment expense. There are many stories of institutions being surprised when an expensive cluster needs to be thrown away, and the same large quantity of money spent to buy a new one.
  4. NIH funding mechanisms are good at funding individual investigators, or at one-time capital equipment expenses, or at charging services (there are line items on grant application forms that seem to have originated in the days of mainframe computing!), or at large (regional or national) technology resources. Institutions have generally struggled to find appropriate ways to fund equipment, facilities, and personnel for mid-scale technology cores at the department or institute level, and computing clusters are probably the most dysfunctional example.

So the problem here is that NIH funding mechanisms are not well-matched to the state of the art in computing infrastructure. The result is a lot of subscale, inefficient computing in individual labs. Many individual investigators would be better off with a shared, well-run departmental resource. This would also help in standardization and reliability in other problems (such as implementation of secure access to restricted human datasets).

NHGRI should identify creative ways to sustainably fund department-scale clustered computing resources.

Cloud computing deserves development, but is not yet a substitute for local infrastructure. Cloud computing will have an increasing impact. It offers the prospect of offloading the difficult infrastructural critical mass issues of clustered computing to very large facilities — including commercial clouds such as Amazon EC2 and Microsoft Azure, but also academic clouds, perhaps even clouds custom-built for genomics applications.

Cloud computing also offers the prospect of “moving the compute to the data” — allowing arbitrary computations on centrally hosted large datasets — which is another technique we can use to alleviate data volume challenges.

At the same time, remote cloud resources will be initially difficult to use for the same reasons that regional or national supercomputing centers have had little impact on computational genomics: our field puts a premium on high interactivity with the datasets and low latency to a command-line prompt.

The success of cloud computing in computational genomics will depend on the development of ways to hide the fact that the cloud is a remote resource, and have the resource behave as if it were on the desktop. Improved networks and network-based computing models will increasingly enable this.

This necessarily means that making cloud resources useful to individual investigators will require a layer of new, user-friendly software development; but our field is already critically challenged by software development.

We should invest in pilot projects in cloud computing to spur software development and eventual adoption of cloud-based models. We should not, however, view cloud computing as a near-term solution for biological analysis; we should view these pilot projects as technology research, while we simultaneously enable stable current technology (racked clusters at departmental level) to be deployed for current biological analysis needs.

NHGRI should also take an active role in liaison/communication/collaboration with large cloud providers. The main issues that need to be addressed (making cloud resources appear on the desktop) will be more cost-effectively addressed by general solutions implemented by large-scale cloud providers where we just make sure we have a hand in the game so we make sure these solutions also address our needs, rather than NHGRI funding development of a lot of expensive, specialized cloud software on its own.

Spur better software development. Traditional academia and funding mechanisms do not reward the development of robust, well-documented research software; at the same time, the history of commercial software viability in a narrow, rapidly-moving research area like computational genomics is not at all encouraging. It is a fundamental ethical principle that research tools published in the scientific literature must be freely and openly available for reproduction and extension; this necessarily hinders commercialization of such tools.

NHGRI should identify a number of overlapping creative solutions to the software development problem. Some ideas include:

  1. A program that funds developers in much the same way that HHMI or the NIH Pioneer Awards fund people not projects. NHGRI could allocate stable long-term funding to a small but influential number of individual developers. The history of the field is that the best software in the field is often an unplanned labor of love from a single investigator; the history of software development shows that the disparity between the best developers and average ones is enormous, so business studies recommend models that enable highly skilled developers to focus on what they do best; and the best developers are often quirky people who don’t write grants well.
  2. A small number of “centers of software excellence”, similar to or building upon the National Centers for Biomedical Computation, where the task is not to invent new analysis methodologies, but rather to identify newly published methods in the literature, and integrate and/or implement them into robust and usable software packages — essentially a nonprofit model of what commercial companies would be doing, if a large stable market existed in our space. A lot of computational biology research right now “talks to itself”, sinking and disappearing into specialized journals, without ever being integrated into widely useful tools.
  3. Use SBIRs to catalyze more commercial development. Although the past history of commercial software development in bioinformatics is discouraging, some new trends may change the game somewhat. We may never be able to effectively commercialize the development of cutting-edge analysis methods, where the methods change rapidly and must be openly published. However, there are areas where closed, stable solutions would be fine. The advent of next-generation sequencing is creating a market for “black box” computation where the individual lab just wants certain jobs (like sequence assembly and mapping) done well, rather than wanting their own hands on the analysis. The advent of cloud computing is creating a need for robust “middleware” to connect an individual desktop to analysis methods in the cloud.

Make better integrated informatics and analysis plans in NHGRI big science projects. NHGRI planning of big science projects is generally a top-down, committee-driven process that excels at the big picture goal but is less well-suited for arriving at a fully detailed and internally consistent experimental design. This is becoming a weakness now that NHGRI is moving beyond data sets of simple structure and enduring utility (such as the Human Genome Project), and into large science projects that ask more focused, timely questions. Without more detailed planning up front, informatics and analysis is reactive and defensive, rather than proactive and efficient. The result tends to be a default into “store everything, who knows what we’ll need to do!”, which of course exacerbates all our data problems. Three suggestions:

  1. A large project should have a single “lead project scientist” who takes responsibility for overall direction and planning. This person takes input from the committee-driven planning process, but is responsible for synthesizing a well-defined plan.
  2. That plan should be written down, at a level of detail comparable to an R01 application. Forcing a consensus written plan will enable better cost/benefit analysis and more advance planning for informatics and analysis.
  3. That plan should be peer reviewed by outside experts, including informatics and analysis experts, before NHGRI approves funding.

10 thoughts on “The next five years of computational genomics at NHGRI

  1. Thanks for a very interesting snapshot of the current thinking in the US. We at SciLifeLab (Sweden) are currently setting up our operations, and this is very useful for our discussions.

    A note on how to make Cloud Computing useful: It is my working hypothesis that well-designed web services (translation: REST, not SOAP) are the key to making Cloud Computing resources (including academic regional compute centers) practically useful to biologists. Having the compute resources available in your local browser is as good as having it on your local desktop, to a first approximation.

    Like

  2. Sean, I agree with Per. As I was reading your discussion of cloud computing I was thinking “hmm, I don’t think he does much Web programming” :). AJAX/Web 2.0/REST/XML-RPC/JSON-RPC/Comet etc really change the game with data analysis — or at least I expect they will, as I haven’t seen too many people making use of them. Still, it’s clearly one way forward into cloud computing: you can have your “big data” and “big compute” elsewhere, and interact with them over a fairly thin pipe. I still think HPCs are going to be the only way to go for certain really big jobs (next-gen sequence assembly is one place where the cloud is unlikely to work out well, given the memory requirements) but a lot of the day-to-day server-based stuff can be moved out of the lab and into the cloud, IMO.

    –titus

    Like

  3. We’re in agreement here, even though it’s true I don’t do much Web programming. My point to NHGRI is that we’re already struggling with software development in this field on multiple fronts. If cloud computing can be made useful by a good software layer – and I agree that should be possible – then it sounds like we’d have to open a new front in a war we’re already bogged down in. Who’s going to do this software development, and with whose $$, and how should NHGRI orchestrate it?

    Like

  4. Pingback: I can haz outreach? Nobody speaks for the end users. | The OpenHelix Blog

  5. Fantastic ideas you’ve proposed – especially “A program that funds developers in much the same way that HHMI or the NIH Pioneer Awards fund people not projects” and “a small number of centers of software excellence… where the task is not to invent new analysis methodologies, but rather to identify newly published methods in the literature, and integrate and/or implement them into robust and usable software packages”.

    So much computational research is published but lost forever because the end result is not usable by biologists. As an example, a PubMed search of “neuron image algorithm” yields 862 papers… it is far more successful to request funding to collaborate with a computer scientist to develop a new algorithm for your project than to request funding to *read* existing papers and try to implement and test a few algorithms for your project. As well, although it might be the same amount of effort, publishing the former is feasible whereas the latter is not. Thus, the end-user community needs expert help in two ways: (1) curation of which algorithms are practically useful and for what purposes, and (2) implementation in a usable software package so algorithms can be quickly tested and compared.

    Computer science departments only support PhD students who develop something novel, irrespective of its utility. NIH and NSF rarely support proposals to make existing algorithms usable, only to develop novel algorithms (with rare exceptions, like http://grants.nih.gov/grants/guide/pa-files/par-08-010.html). Designated programs like you propose, Sean, could provide a huge payoff for the scientific community, with minimal investment of resources. That algorithms ‘languish’ in the literature is really a tragedy of wasted research funds and a significant leak in the pipeline.

    Like

  6. Full disclosure up front: I recently joined NCSA, the largest NSF-funded HPC facility in the US, and was previously running a “University-sized” HPC center in Switzerland. Therefore I have some biases in favor of relatively large scale computer facilities, although I am also fully aware of the practical and cultural reasons why such facilities are not favored by computational biologists. I was present at the NHGRI-sponsored meeting, and congratulate Sean on a very balanced and perceptive summary of the discussions. I would like to raise a few points that did not make it to the table during the meeting, though:

    1) I fully agree with Sean that “We must make sober plans for sustainable exponential growth (this is not an oxymoron)” in the face of the “data tsunami”. But we should not underestimate the technical challenges in managing such large data volumes. Large HPC facilities have this experience (e.g. from managing data flows from astronomy or environmental monitoring projects) and could help in designing data centers for genomics that would hold up to the tsunami. These are not off-the-shelf solutions from vendors.

    2) The “spur better software development” section makes a number of excellent suggestions. The need to validate code (i.e. to ensure that it produces the correct results under widely varying circumstances) may be more specifically emphasized. We should also take into account the need to address the very rapidly changing architecture of the machines running the codes, in particular the ubiquitous deployment of many-core processors and highly parallel computer architectures, which are currently the only drivers of increased performance. This will require re-designing the code underlying many “workhorse” applications in bioinformatics as well as acquiring the necessary skill sets to write new code. Again, large HPC facilities have a lot of experience in these areas. They would be ideal partners in software development projects and could also contribute to training the programmers working on genomics applications.

    3) It is clear that clouds are not ready for prime time as workhorses for computational biology, and that deploying them effectively is still a research area. However, clouds are not necessarily synonymous with ho-hum performance. A combination of user-friendly Web services (as suggested by Per and Titus) serving as a front end, and a cloud designed from the start to deliver the type of performance (especially on the I/O side) required for biocomputing applications could go a long way to “moving the compute to the data” and to providing HPC performance to the community without the usual burdens associated with monolithic HPC facilities. In other words, virtualizing HPC resources (existing or new) may be a way forward, especially if the focus in kept on democratization.

    Like

  7. Pingback: Bioinformatics and software development – yet again

  8. Dear Sean, it has been several years since we last spoke and corresponded. The production scale sandbox for biology informatics is in place at the Data Intensive Discovery Initiative at the University at Buffalo, SUNY, through the generous funding of the NSF, NIH and State of New York ($10,000,000): HPC Cluster (@70 Tflops), GPU Cluster (@170 Tflops) and two massively parallel FPGA enabled Data Intensive Supercomputers (Netezza & XtrememeData). Vipin Chaudhary is leading the effort to bring the query to the data. Kind Regards, Todd

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s