The next five years of computational genomics at NHGRI

The NIH National Human Genome Research Instituteis going through a process of making plans for the next five years of research in genomics, including computational genomics. In the past couple of months I've been at two planning meetings - the Cloud Computing workshop (31 March - 1 April) and most recently the NHGRI Informatics and Analysis Planning Meeting(21-22 April). Goncalo Abecasis and I got the job of trying to summarize the consensus of the Informatics and Analysis meeting. I'm not sure how good I am at identifying consensus, but I just sent off four pages of notes to Vivien Bonazzi, one of NHGRI's informatics program officers, describing some of my personal views of the future, strongly colored by the discussions at the planning meetings. I thought I'd share the same comments here. Transparency in government and all that. Just imagine all the potential conflicts of interest here; good thing I'm paid by a dead billionaire, not so much by federal tax dollars.

Anyone with an interest in how the sausage is made at NIH might want to peek under the hood, below.

Plan explicitly for sustainable exponential growth. We keep using metaphors of data "tsunamis" or "explosions", but these metaphors are misleading. Big data in biology is not an unexpected disastrous event that we have to clean up after. The volume of data will continue to increase exponentially for the foreseeable future. We must make sober plans for sustainable exponential growth (this is not an oxymoron).

Use convening power to connect to outside expertise. The wider world of computing research is now well aware that biology is a major growth area. It was great to see organizations like Amazon, Google, and Microsoft represented at this meeting, as well as several of the DOE national laboratories and other academic HPC experts. NHGRI should aggressively expand its efforts to build connections not only cross-institute, but cross-agency (especially DOE, but also DoD, Agriculture, and other agencies that do high performance computing in biology), and into the parts of the commercial sector (Microsoft, Google, Amazon) where there is substantial expertise and interest in large scale data analysis. NHGRI's "convening power" is a way to effect change while making the most of limited funding resources.

Focus on democratization. A large challenge facing us is that large-scale data generation methods -- sequencing, mass spec, imaging, and more -- are increasingly available to individual investigators, not just large NHGRI-funded genome centers. To enable individual investigators to make effective use of large datasets, we must create an effective infrastructure of data, hardware, and software. NHGRI has extensive experience in big data, and can lead and catalyze across the NIH.

Datasets should be "tiered" into different appropriate representations of different volumes. Currently we tend to default to archiving raw data - which not only maximizes storage/communication challenges, but also impedes downstream analyses that require processed datasets (genome assemblies, sequence alignments, or short reads mapped to a reference genome, rather than raw reads). We should push to develop different levels of standardized representations of sequence data, including:

  1. Data structure(s) for lossless representation of a large number of genome sequences from a population (such as thousands of human genomes) by differences amongst them ("ancestral variation graphs", AVGs, were mentioned at the meeting).
  2. Data structure(s) for lossless representation of a large number of short reads from a ChIP-seq or RNA-seq type experiment by position/difference against a stable reference genome.
  3. Data structures for histograms of short reads mapped to a reference genome coordinate system for a particular ChIP-seq or RNA-seq experiment. Many analyses of ChIP-seq and RNA-seq data don't need the actual reads, only the histogram.

A principle challenge in creating these tiers is the appropriate representation of uncertainty and data quality from a more raw tier to a more processed tier.

Plan for more kinds of data than genome sequence. Mass spectroscopy and imaging also generate large datasets. NHGRI is naturally most focused on sequence data, but the same lessons we learn in handling sequence data need to be applied in other areas.

Reduce waste on subscaled computing infrastructure. NIH-funded investigators with large computing needs are typically struggling. Many are building inefficient computing clusters in their individual labs. This is the wrong scale of computing, and it is wasting NIH money. Several competing forces are at work:

  1. Racked clusters are the current state of the art for storage and computing infrastructure. These facilities require specialized facilities for space, power, and cooling, and highly skilled people to design, build, and manage them. Few individual investigators have the critical mass to justify the expense or to do this well; well-run clusters that run with reasonably high loads are, at minimum, department- or institute-level resources.
  2. At the same time, overly centralized (regional or national) supercomputing resources don't work well in this field. High performance computing in computational biology has a fundamentally different workflow than HPC in other fields. We face problems of integrating numerous unreliable, complex, rapidly changing datasets. Analyses often depend on custom programs that are used for a day and thrown away. We need to be intimately, interactively close to our data analyses. The latency of pushing individual commands and data to and from a remote computing facility is too great a burden; time and again, we find in actual experience that computational biologists simply do not use remote computing facilities, preferring to use local ones. HPC experts in other fields tend to assume that this reflects a lack of education in HPC, but many computational genomicists have extensive experience in HPC. I assert that it is in fact an inherent structural issue, that computing in biology simply has a different workflow in how it manipulates datasets. For example, a strikingly successful design decision in our HHMI Janelia Farm HPC resource was to make the HPC filesystem the same as the desktop filesystem, minimizing data transfer operations internally.
  3. Clusters have short lives: typically about three years. They are better accounted for as recurring consumables, not a one-time capital equipment expense. There are many stories of institutions being surprised when an expensive cluster needs to be thrown away, and the same large quantity of money spent to buy a new one.
  4. NIH funding mechanisms are good at funding individual investigators, or at one-time capital equipment expenses, or at charging services (there are line items on grant application forms that seem to have originated in the days of mainframe computing!), or at large (regional or national) technology resources. Institutions have generally struggled to find appropriate ways to fund equipment, facilities, and personnel for mid-scale technology cores at the department or institute level, and computing clusters are probably the most dysfunctional example.

So the problem here is that NIH funding mechanisms are not well-matched to the state of the art in computing infrastructure. The result is a lot of subscale, inefficient computing in individual labs. Many individual investigators would be better off with a shared, well-run departmental resource. This would also help in standardization and reliability in other problems (such as implementation of secure access to restricted human datasets).

NHGRI should identify creative ways to sustainably fund department-scale clustered computing resources.

Cloud computing deserves development, but is not yet a substitute for local infrastructure. Cloud computing will have an increasing impact. It offers the prospect of offloading the difficult infrastructural critical mass issues of clustered computing to very large facilities -- including commercial clouds such as Amazon EC2andMicrosoft Azure, but also academic clouds, perhaps even clouds custom-built for genomics applications.

Cloud computing also offers the prospect of "moving the compute to the data" -- allowing arbitrary computations on centrally hosted large datasets -- which is another technique we can use to alleviate data volume challenges.

At the same time, remote cloud resources will be initially difficult to use for the same reasons that regional or national supercomputing centers have had little impact on computational genomics: our field puts a premium on high interactivity with the datasets and low latency to a command-line prompt.

The success of cloud computing in computational genomics will depend on the development of ways to hide the fact that the cloud is a remote resource, and have the resource behave as if it were on the desktop. Improved networks and network-based computing models will increasingly enable this.

This necessarily means that making cloud resources useful to individual investigators will require a layer of new, user-friendly software development; but our field is already critically challenged by software development.

We should invest in pilot projects in cloud computing to spur software development and eventual adoption of cloud-based models. We should not, however, view cloud computing as a near-term solution for biological analysis; we should view these pilot projects as technology research, while we simultaneously enable stable current technology (racked clusters at departmental level) to be deployed for current biological analysis needs.

NHGRI should also take an active role in liaison/communication/collaboration with large cloud providers. The main issues that need to be addressed (making cloud resources appear on the desktop) will be more cost-effectively addressed by general solutions implemented by large-scale cloud providers where we just make sure we have a hand in the game so we make sure these solutions also address our needs, rather than NHGRI funding development of a lot of expensive, specialized cloud software on its own.

Spur better software development. Traditional academia and funding mechanisms do not reward the development of robust, well-documented research software; at the same time, the history of commercial software viability in a narrow, rapidly-moving research area like computational genomics is not at all encouraging. It is a fundamental ethical principle that research tools published in the scientific literature must be freely and openly available for reproduction and extension; this necessarily hinders commercialization of such tools.

NHGRI should identify a number of overlapping creative solutions to the software development problem. Some ideas include:

  1. A program that funds developers in much the same way that HHMI or the NIH Pioneer Awards fund people not projects. NHGRI could allocate stable long-term funding to a small but influential number of individual developers. The history of the field is that the best software in the field is often an unplanned labor of love from a single investigator; the history of software development shows that the disparity between the best developers and average ones is enormous, so business studies recommend models that enable highly skilled developers to focus on what they do best; and the best developers are often quirky people who don't write grants well.
  2. A small number of "centers of software excellence", similar to or building upon the National Centers for Biomedical Computation, where the task is not to invent new analysis methodologies, but rather to identify newly published methods in the literature, and integrate and/or implement them into robust and usable software packages -- essentially a nonprofit model of what commercial companies would be doing, if a large stable market existed in our space. A lot of computational biology research right now "talks to itself", sinking and disappearing into specialized journals, without ever being integrated into widely useful tools.
  3. Use SBIRs to catalyze more commercial development. Although the past history of commercial software development in bioinformatics is discouraging, some new trends may change the game somewhat. We may never be able to effectively commercialize the development of cutting-edge analysis methods, where the methods change rapidly and must be openly published. However, there are areas where closed, stable solutions would be fine. The advent of next-generation sequencing is creating a market for ``black box'' computation where the individual lab just wants certain jobs (like sequence assembly and mapping) done well, rather than wanting their own hands on the analysis. The advent of cloud computing is creating a need for robust ``middleware'' to connect an individual desktop to analysis methods in the cloud.

Make better integrated informatics and analysis plans in NHGRI big science projects. NHGRI planning of big science projects is generally a top-down, committee-driven process that excels at the big picture goal but is less well-suited for arriving at a fully detailed and internally consistent experimental design. This is becoming a weakness now that NHGRI is moving beyond data sets of simple structure and enduring utility (such as the Human Genome Project), and into large science projects that ask more focused, timely questions. Without more detailed planning up front, informatics and analysis is reactive and defensive, rather than proactive and efficient. The result tends to be a default into "store everything, who knows what we'll need to do!", which of course exacerbates all our data problems. Three suggestions:

  1. A large project should have a single "lead project scientist" who takes responsibility for overall direction and planning. This person takes input from the committee-driven planning process, but is responsible for synthesizing a well-defined plan.
  2. That plan should be written down, at a level of detail comparable to an R01 application. Forcing a consensus written plan will enable better cost/benefit analysis and more advance planning for informatics and analysis.
  3. That plan should be peer reviewed by outside experts, including informatics and analysis experts, before NHGRI approves funding.