Computational challenges in the future of large-scale sequencing

240px-us-nih-nhgri-logosvg

I’m preparing a plenary talk on Computational Challenges in the Future of Large-Scale Sequencing that I’ll give at a workshop on The Future of the Large-Scale Sequencing Program, which the NIH National Human Genome Research Institute (NHGRI) is convening in two weeks (Bethesda, 23-24 March 2009). This is part of a long range planning process for NHGRI. This will be the one talk at the workshop that focuses on computational biology’s future needs, so I’m aiming to present a community consensus view as guidance to NHGRI, not just my own view.

What do you think NHGRI should hear?

The main issues will be:

  • Data quality: Will we have adequate means to capture and encode quality measures on large scale datasets?
  • Data storage: Will labs have the infrastructure to store the data that they will generate in the next ten years? Do we need new methods and best practices in deciding what data is kept and what is thrown away?
  • Data communication: Will labs have the necessary network infrastructure to transfer data between labs, or between labs and public databases? Will we need new standards for datatypes, or even for data compression? Do we have the right infrastructure for public databases (at NCBI and elsewhere)?
  • Data analysis: Will labs have access to the computing hardware they need? To the analysis software they need? To the human expertise they need?
  • Data integration: Especially as NHGRI moves into large-scale human clinical sequencing, we will need to integrate sequence and medical phenotype data. What infrastructure will we need?

One thing I’ll be trying to do is to project, at a high level, the exponential growth curves for various kinds of large-scale data (primarily sequence data, but also image and phenotype data associated with sequence-based assays, and curation and literature in the public sequence databases), versus exponential growth curves for storage, network, and processors, to identify the most important choke points in the ability of computational infrastructure to handle the job.

From that base, I’d like to identify opportunities for high impact investment by NHGRI in the general computational biology infrastructure. Should we invest in research in data storage and transmission algorithms, or is that all pretty obvious computer science that can just be applied? Should we invest in development of top-down data standards (including quality measures), or in ways to facilitate evolution and propagation of grassroots community standards? Should NHGRI take more of a role in investing in public software and algorithm development, or are there ways to facilitate more effective private investment? Should NHGRI fund better computational infrastructure (clusters and networks), and if so, at what level — individual labs, departments, institutions, or NSF-like national computing centers?

The angle I’m likely to focus the most on is the democratization of large-scale sequencing. I believe that in the next ten years, it will (must!) become commonplace for the average R01-funded individual lab to use next generation sequencing technology and collect large-scale sequence information — assembling a new species’ genome for comparative analysis, mapping interesting traits by complete sequencing of many individuals in a population, mapping new mutants simply by complete genome sequencing. And not just genomic DNA sequence, but also all the applications of sequencing technology as a digital replacement for microarrays and gels: ChIP-Seq, RNA-Seq, and other such sequence tagging strategies. I don’t think the infrastructure is ready for this at any level: we don’t have best practices established for data storage or transmission, nor for public data archives, nor for journal publication; individual labs don’t have sufficiently powerful hardware, robust software, nor knowledgeable humans to make the best use of the data. We’re at a crossroads, where we are already pushing the technologies out of the genome centers and into individual labs, but that means that biology is going to have to establish the same sort of high-quality computational infrastructure that other fields have had to develop, such as high-energy physics.

Do you agree? Am I missing key issues? I’d like to hear feedback that I can incorporate into my talk, either in email or in comments here.

I would particularly like to have specific examples or specific data to back up any particular point. As a disciple of Tufte, I’d rather assemble substance, not bullet points.

13 thoughts on “Computational challenges in the future of large-scale sequencing

  1. I certainly agree that a key issue to focus on is the democritization. In particular, the computational resources and training part of the democritization will be critical. But I do not think high energy physics is necessarily the right comparison. I think more appropriate comparisons are going to come from more broad reaching approaches like the useof blast at NCBI or iPhone APPS or other such things. We need to envison a world where pretty much all biologists would be able to make use of next gene or next-next gen sequencing data but where not all of them will have to become full fledged informatics people.

    Like

  2. A couple of things spring to my mind that you may or may not want to talk about. One is that “we” are still terrible at ensuring reference genomes and assemblies go into the public sequence repositories with proper and consistent versioning. This is making it very difficult for projects that annotate EMBL/Genbank sequences that would like such as Rfam and UniProt to provide mappings to everyones favourite genome. New policies from eg. ENSEMBL are coming into place that ensure this but they are still not happening fast enough and are not being enacted retrospectively.
    Here is a relatively trivial example, if you go to ENSEMBL you see that they are using Human genome version NCBI36. Now if you go to the EMBL genomes page, click on Eukaryotes and scroll down to human you have no way of knowing which corresponds to NCBI36. There are worse examples which are probably too painful to go in to.

    My other related issue which I imagine will come up whether you mention it or not is the stale annotation of sequences in EMBL & Genbank. This has been discussed elsewhere (eg. Proposal to ‘Wikify’ GenBank Meets Stiff Resistance. However, I think the naysayers are dead wrong. It is incredibly painful to get annotation corrected in EMBL. I’ve found “protein” annotations that partially or fully overlap classical ncRNAs (eg. RNase P). I then have to contact the original authors and tell them that their longest ORF approach has slurped a ncRNA sequence. They then have to get this sorted out via some painful process run by EMBL or Genbank. I could do this sort of correction on a very large scale since I imagine many proteins (and possibly protein families) are in-fact due to sloppy annotation. However, the energy barrier for doing this is far too high at the moment.

    The democratization of large-scale sequencing is indeed very cool. Many of the small university groups I used to work with now have these incredible machines. But if re-annotation and/or parallel annotation of sequences is denied to them then I can imagine valuable results being lost. For example, RNA-seq data is proving to be a beautiful and valuable resource that I could only have dreamed of previously. But where is all that data going o go? How is it going to be correctly and consistently tied to reference genomes?

    I guess I’ve ended up echoing many of your concerns. I also doubt the big centres are going to be able to accommodate this data. They seem to be monolithically slow at responding to new data, perhaps this is the way things should be, they do a few things and they do it well. Maybe some sort of UCSC CustomTracks and/or DAS will save us all instead.

    Like

  3. I think the democratization needs to be addressed. If everyone has a machine easily accessible that can produce gigabases a day, what kind of software/hardware/training should each individual genetics lab be expect to do themselves? Will it be have a combio “buddy” for every dept or will the expertise be expected in the lab? Maybe the software will be good enough and make it as easy (read: GUI enabled) as various platforms (and mostly closed-source) that non-computational labs use? I think there should also be encouragement (read funding) to support labs/companies building easy-to-use software for this scale of data too.

    This flows into the best practices point. I also think best practices is important to establish. There are already dozens of software packages to process the current crop of data, but how do we integrate from next-gen to next-next-gen? Will it be let a 1000 flowers bloom and the best will be obvious? These kinds of areas are where leadership is needed. There is always going to be individual labs pushing out the next and best algorithm and innovating on ways to process the data. The incentives to do this are already part of the funding structure and scientific pursuit. But the herding of standards and best practices for data, methodology, dissemination, and how the information is included when publishing is a little harder to organize organically and it seems like this is an important part of translating the cutting edge to something appropriately useful to non-compbio research labs to the eventual clinical applications.

    Like

  4. I have a lot of background doing high performance bioinformatics computing. My comments are based on personal experience.

    As you allude, high-energy physics community designed their data analysis around a very simple model:

    1) building a hierarchical data distribution system which propagates large amounts of data to many sites efficiently.
    2) building a very high performance network that couples all the sites
    3) rewriting analysis codes to work on small units with well-defined inputs and outputs
    4) identifying, implementing, and installing systems that manage large-scale data transfer and remote job submission across the infrastructure.

    With the above, high energy physics gets a huge amount of work done. Typically jobs scale to the number of CPUs available- not to the interconnect performance or to the disk I/O of the file server. In my experience working in bioinformatics, people rarely design their pipelines around the idea that the computations can be picked up and moved elsewhere. A good software release engineering process applied to bioinformatics data analysis would be very valuable here. Whenever I’ve spent the extra time to write my pipeline in a portable way, it’s paid off- often in ways I never expected.

    Regarding software release engineering- it still boggles me how many major biocomputing codes are near impossible to work with. Recently I wanted to work with the NCBI Blast source; after downloading a huge software package with a bunch of unrelated features, I had to dig through pages of READMEs; then do a compile that took hours, then spend a bunch of time debugging the compile. I hit these problems also when trying to reproduce published work- typically, the source code, when I can get it, no longer compiles, doesn’t have tests, doesn’t have quality control, doesn’t have sample data and output. NIH *should not fund people who refuse to release software with their publication that includes tests and enough data to reproduce the figures in the paper*.

    So my suggestion would be to focus on *shared infrastructure for biocomputing* and *
    A big issue in my experience is that most people who build sequence analysis pipelines underutilize their CPU capability while hammering their disk IO (on the file server). This is typically because scientists don’t know the basic costs of the convenience of shared data. So education on shared infrastructure design and how to maximize its utility is also important.

    Like

  5. I wish I could see your version of this talk because you are so eloquent and a brilliant forward thinker. I’ve seen “this” talk a dozen times now and its always the same and it always feels wrong.

    DATA

    Gina Costa from Applied Biosystems has the best figure in her talks on how output ( GB / machine / day ) has grown from classical sequencing, through next-gen, through advances in her lab on the SOLiD. I’d ask for it for your talk.

    On the same graph she plots Moore’s law, and shows that GB sequence output/machine/day is far faster growing than an 18 month doubling.

    Can we really hope to design better compression/transmission/storage faster than we can design higher-throughput sequencing? I think its a waste of time. Its cheaper to re-run the sequencing than to store images from any next-gen for even a year. With current oversampling and increasing quality, you hardly need to keep the primary reads, much less transmit them.

    Images (Illumina) = 2 TB / slide. SOLiD is more like 5TB / slide (need to check for SOLiD)
    Reads (FASTQ) = 16GB slide (txt format probably a bad idea)
    Assembly Contigs (FASTA) = 200MB / slide

    But those 2 meaningless objectives, reads and images, bog everyone down. Its a big political thing like ZTR or SRF or SRA and helps no one. The submitters make it only to submit it (not as an operational file format), and the archive holds it for almost no one to access it.

    A gene expression profile or SNP profile or assembly is pretty small (relative to images/reads) 1 at a time, and these are the pieces of data biologists are really interested in right? Why not focus on data structures to maintain and cross-correlate all of these higher-level results in a meaningful, auto-generated, probabilistic way?

    ANALYSIS

    This is the big problem. I don’t think it is ever going to be the case that biology-only end users can make meaningful use of sequencing data. I think education is more the solution here. Even biology only PhD students need to take a generous amount of math and compbio.

    But I think there is a common sneaky economics problem here, you aren’t going to get the big benefit from teaching current students math/compbio. Rather, I think currently a good number of people that hate math preferentially sort into biology, because they expect it to be math-lite. So as curricula change, you will get students with higher math/CS aptitude sorting into biology PhD programs and THAT will give you the bulk of the benefit. I could pull papers on similar market-sorting examples if you wanted.

    The biggest problem I think is cross-optimization across the wet-lab/sequencing/computational domains. And you can’t do that with 1 person from each discipline, you need interdisciplinary thinkers. I think IDEO has the most organizational research on this. I can ask the business school innovation professors if you really want to know some management journal articles on that from a more empirical/theoretical perspective than IDEO popular-press stuff.

    Like

  6. Just a note to say that I’m soaking all this in, along with the many emails I’ve received in the past 24 hours. Just because I’m not replying doesn’t mean I’m not listening; I agree with all that’s getting said, and it’s all getting digested into my notes for the talk.

    Like

  7. I’m coming to next-gen somewhat backward so might have a useful perspective; after my PhD in sequencing tech development in the ’90s, I then went to industry and wound up running a pharmacogenomics (PGx) business prior to coming to the University of Texas to start a next-gen core. Our charter is explicitly to support analysis and interpretation of the data, so I have been immersed in these issues. Edward Marcotte suggested I pipe in on your forum.

    UT’s researchers are predominantly small single-investigator labs, often specialized on non-model organisms. This community is extremely excited about next-gen (hence my job) but also very under-prepared as has been acknowledged. UT hosts one of the large NSF-funded supercomputing centers (largest academic supercomputing cluster in the world at the moment) and my facility works closely with them to use this capability, so unlike most academics we have virtually unlimited compute power and storage space.

    So far we’ve done: whole genome de novo sequencing (bacteria), RNA-seq, smallRNA sequencing, and ChIP-Seq. For all but ChIP-Seq, I shoulder the data analysis mantle at the moment.

    For the field overall and what NHGRI must hear, I strongly agree with Paul’s comments regarding annotation. I supported a microRNA profiling product and the revisions in mirBase nearly killed us (and our customers) many times. Now that I’m assisting with, for instance, RNA-seq annotations, the circular nature of annotation seems absurd – we base our annotations on other existing annotations, which are fraught with errors or are based themselves on older annotations that may have since changed. If I were basing a significant conclusion on annotation, I would want to re-annotate everything dynamically to ensure nothing’s changed. This is a government-level structural problem that could get out of control quickly.

    The next biggest problem I see with next-gen computational biology in this environment is having the users (and operators) simply understand the components: a) database (like for microarrays when spreadsheets became necessary, databases will be for virtually each next-gen dataset), b) computational pipeline(s)/algorithms/methods and c) visualization. I haven’t yet found a community in which I can interact about all three of these areas at the same time, and I can fall out of my depth in any one if I try to go too far.

    Of these three areas, I see less value in NHGRI investment in a) and b) than in c). Lots of databases and schemas exist and will work, and there are already too many algorithms. Incremental improvement should be encouraged in both, but the lack of standardization (both I/O standards and good software engineering concepts as David points out) makes it extremely difficult to plug-and-play different solutions. I attempted to benchmark a half dozen short read aligners, for instance; most worked “out of the box” but then it took 1-3 days to re-parse output for comparison and/or visualization, so I stopped after testing 3. We’re still testing the impact aligner has on sensitivity & specificity of RNA-seq (using the MAQC reference samples), while also trying different ways to make a suitable “reference”.

    Visualization is what seems to be most lacking. I think there may be some commercial solutions coming, but the classic browsers (UCSC, Ensemble, Apollo) are very limited since they are just *browsers*. Stand-alone tools like Circos look powerful, but I haven’t found time to write the pipes, and am daunted knowing that it won’t solve all my problems, so I’ll have to get ANOTHER viz solution, write and validate ANOTHER set of pipes, etc. etc.

    Other observations I hope might be helpful to you (I’m trying to see how to leverage these):
    1) Via caBIG, NCI appears to be systematically solving a lot of these issues; ontologies/grammers, interface standards, certifications, etc. If we’re all heading toward clinical applications, they may have these issues solved well ahead of the next-gen community. But as with any heavy-infrastructure investment like caBIG, it will come with all the weight of that standardization.
    2) Many of these generic issues are old-hat to pharma. Almost all our business in my PGx service was with big pharma, really starting to apply molecular technologies in Phase I or II studies. This meant we were working directly with the clinical data management teams to upload results from multi-site/multi-arm trials. The interface forms and formats were fairly standardized across the industry because they all outsourced their trial work to Clinical Research Organizations (CROs), so to be competitive the industry had to create some suitable standards. And each trial had to coordinate hundreds of different data types, hundreds to thousands of samples, and make sense of it all while protecting patient/subject confidentiality.

    Education and training are critical no matter what. But as I’ve struggled with how to push knowledge of the tools and methods out to a diverse user base, I inevitably have to pull scope back to simply teaching good linux one-liners and basic Perl because anything more specific than this is likely to change in just a few months. Best practices would absolutely help here.

    Like

  8. Regarding data storage, transmission & analysis: the compbio infrastructure is five years behind the average teenager at this point, in that we are making minimal (if any) use of scalable parallel compute technologies like peer-to-peer data exchange or cloud computing, despite the patent fact that these are now mainstream commodity technologies.

    In my view (and I say this as a self-admitted iron fetishist with my own 50-CPU cluster) it is, or will soon become, ridiculous to ask federal agencies to keep forking out money for isolated server racks here and there, with all associated costs (cooling/networking/backup/UPS/sysadmins/colocation/rental/overhead/upgrades), as the economies of scale of cloud computing continue to grow.

    Much of current compbio algorithms research is also going to look very silly and pointless when you get to the point of being able to rent 10,000 CPUs for an hour.

    On the other hand, the software (and even algorithmic/theory) infrastructure for doing bioinformatics on clouds is just not there yet.

    Like I said above, there has been a steady growth in adoption of cloud computing in the commercial consumer sector (see all the new tech startups using Amazon EC2 and S3, such as getdropbox.com for example, and of course all the Google products).

    The same goes for peer-to-peer. Most high-volume software distributors (e.g. games companies) distribute updates by P2P nowadays. Why are we not using this more in bioinformatics? It may seem less urgent than cloud computing but it is potentially related (shifting data into & out of the cloud [mostly into] is a limiting step)

    Therefore, this is quite clearly and demonstrably an area where private sector investment is doing a terrible job (maybe because nextgen sequencing manufacturers have reduced incentive to provide core infrastructure tech that could benefit competitors?) and one where Bethesda could have a major impact.

    Like

  9. BTW, a few technologies that are important for the cloud include: map-reduce infrastructure & map-reduce versions of bioinformatics algorithms, peer-to-peer standards, open cloud/virtualization standards (so ISPs can compete with Amazon & Google), and most of the usual order-from-chaos bioinformatics technologies (ontologies, distributed databases, standard formats, etc.)

    Incidentally, I would also like to see more work on probabilistic model-based DNA compression — surely HMMs + arithmetic coding can do better than bzip…

    Like

  10. Sean, I think all of your points are spot on! There are currently huge amounts of data being generated by the current generation of sequence technologies and there are further techniques under development that are going to increase the rate of increase in sequence data even further. The Trace servers provided by the EBI and NCBI are the main repositories for the primary sequence data. These already represent some of the largest databases in the world and are pushing the frontiers of technology already. The question that comes to mind is ‘Can they cope, storing all of this data now, let alone with future demand’. Ian is right about the lack of peer-to-peer and cloud computing in bioinformatics. However, I would say that the latter is becoming more commonplace. I have heard via word of mouth (and that is all) that Amazon is very keen to enter the bioscience field and are prepared to house some public dataset free of charge.

    There are already sequence datasets that are available, but then not making it into ‘standard’ public repositories. On example comes from the field of metagenomics where the current sequencing technologies are being used to their full potential to generate vast quantities of data. CAMERA (http://camera.calit2.net/) are currently acting as a repository for much of the metagenomics data. However, the sequence data they are not in sequence repositories such as EMBL or UniProt. CAMERA currently have 28 million ORF/protein predictions, a number that currently dwarfs the size of UniProt. However, as this data is not in mainstream sequence repositories, the sequences are not being routinely analysed as database such as Pfam update.

    Regarding moving the data around, I have heard interesting/good things about Jim Kent’s BigWig data format. As far as I know, with the BigWig data format there is an indexed data file that should be exposed simply via HTTP (which even small labs should be able to achieve with little informatics knowledge). It then takes advantage of a little know feature in HTTP that allows one to fetch a slice of the data. I am not sure how well the current DAS model will scale. At a recent meeting, the DAS community seemed to be in general agreement that bottlenecks are already beginning to appear regarding data exchange. The solution will require something along the lines of the BigWig file format and having a sensible way of defining different levels of data granularity.

    One thing that I believe comes into the analysis of data that has not really been touched upon so far is the visualisation of the data. The typical model for accessing a database is to have some form of web interface. We (informaticians) need to get smarter about how we think of data presentation and guide users through the data and we may have to come to terms with the fact that a web browser may not necessarily be the best tool for the job.

    Like

  11. Hi Sean,
    I enjoyed reading your invitation and all of the comments above. Hopefully, I will add some value to the discussion.

    One general comment that was not on your list relates to the mindset of scientists and funding agencies. What I have seen from my vantage point is that scientists (particularly those reviewing funding proposals) haven’t figured out that as sequencing costs drop, a much greater proportion of funding $ needs to be spent on IT, informatics, and, for that matter, thinking. The community is likely to become captive to this capacity — determined to spew out tons of data that exceeds the technical capacities of the repositories, networks, servers, etc, but also the intellectual capacity of the collective scientific community. I think the large scale sequencing programs should be rethought to focus a much larger portion of their $ on thinking, analysis tools, data management, etc, and a shrinking proportion on actual generation of sequence data.

    Data quality: This has always been an issue. The public repositories may police the syntactic elements of data submissions, but they don’t police data quality. For sequence data this leads to a fairly high level of unusable sequence data in the repositories. What is worse, for “analyzed” data, such as genomic anotations in Genbank, low quality annotations get propogated for years. Others above have commented on the difficulty of correcting the errors present in the original submissions. I see the problem somewhat differently. There are an abundance of islands of higher quality annotation that don’t share their annotation back to the public repositories, thus reducing its impact. We are in the midst of a successful collaboration with NCBI to update the annotation of our data in GenBank, and to improve the quality of the annotation for non-JCVI genomes in RefSeq. Ensuring that the public, archival, repositories have the highest quality annotation seems like an important goal.

    Data Types: Just wanted to respond to Rob Finn’s comments regarding metagenomic ORFs/Proteins. NCBI is working on a repository for annotated protein fragments — not sure when it will go live. UniProt certainly picks up published proteins from the Global Ocean Sampling expedition (those that appear in CAMERA), not sure what they do with them. Since NIH/NCBI are naturally focused on human-centric data, I’m confident that the NCBI folks will address this gap in the near term since the HMP will be generating large volumes of similar data.

    Data Storage/Communication/Analysis: I’m not an IT technology expert, so I will pass at the ten year predictions. However, assuming that for the near term, the dramatic growth spurt in sequencing technologies will swamp the rate of improvement for IT technologies, I’m tempted to extrapolate from current practice. We currently don’t use cloud computing because of the $/bandwidth required to move data into the cloud, and the lack of cloud-enabled tools. There are also HUGE gains possible through better utilization of idle capacity, and better utilitization of the capabilities of existing CPUs (e.g., utilizing the SSE instruction set). Then again we are not a typical shop, since we already have the infrastructure/staff to support a sizable computational grid. The thing that boggles my mind is that the funding agencies continue to pay to install horribly underutilized compute facilities at their fundees’ sites, rather than putting in place *usable* large scale compute services. Teragrid has not (yet) achieved this goal. By usable, I mean, usable without software engineers. As far as I’ve been able to tell, the only public grid resource that can support a BLAST of a 454-Titanium run against NCBi NR is the CAMERA compute cluster. I think it would be wise to make investments to make large scale computational workflows available from researchers desktops through funder-supported compute clouds, assuming that the communications bandwidth issues can be overcome through existing (e.g., P2P) or new technologies.

    Like

  12. Pingback: The Rfam decimal release is out! « Xfam Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s