I've been thinking about how we do "bioinformatics" in experimental biology, and I had the opportunity to talk about it recently. The following is the transcript of a keynote address I gave at Janelia's meeting on High Throughput Sequencing for Neuroscience last weekend.
It's a privilege to be speaking to you tonight to open this meeting on "High Throughput Sequencing for Neuroscience". I don't mean that lightly. It really is an unusual opportunity, and I'll explain why.
When the organizers invited me to give this keynote address, at first I declined. I'm not the best one to be speaking about the science my group is doing in genomics and neuroscience. That work is being driven by two independent scientists in my group, Lee Henry and Fred Davis. Lee will talk tomorrow, and Fred has a poster you should look for.
Lee, Fred, and I are part of a team at Janelia called "NeuroSeq", which has various aims loosely around the application of high throughput sequencing to problems in neuroscience. You're going to hear more about Janelia's NeuroSeq projects not just from Lee, but also from two other NeuroSeq team members, Sacha Nelson and Ken Sugino, and also from some of Lee and Fred's collaborators including Joe Ecker and Alisa Mo.
Here at Janelia, part of our philosophy is that our PIs work hands-on, in small groups, in a sort of old school way modeled on places like the MRC Laboratory of Molecular Biology in Cambridge, England. I was a postdoc at the LMB, and one thing I learned there from watching people like John Sulston, John White, and Jonathan Hodgkin work is that I like to hear people talk about science that they're actually doing with their own hands. The science that I'm doing with my own hands these days happens to be on a computational biology problem that's not directly relevant to the topic of this meeting, so I don't want to talk in detail about that, and I want to leave Lee and Fred to tell you about their own work. So while it was an honor to be asked to talk tonight, at first I wondered whether I had anything to say to you.
Then I saw the title of the Oliver Hobert's talk, the first talk on our program tomorrow. If you haven't seen it yet, Oliver's talk is titled "Do we need deep sequencing to define and study neuronal diversity?" I'm guessing his answer to that question is no: no, we don't need all this newfangled hype and glitz to do insightful science.
In fact, Oliver's been trash talking computational biology and genomics for some time now, and he's been getting under my skin, in part because his work has been so damned successful. I realized that tonight is the first time I've ever had the opportunity to speak before Oliver. I decided I should make the most of it.
The answer to Oliver's question is yes: yes, high throughput sequencing is going to be very important for neuroscience. It already is. We don't need it, but it's crazy not to use it.
But even though I think Oliver and I would give opposite answers to his question, I suspect that actually, he and I entirely agree on the deeper, substantive issues that questions like his raise. It's those issues that I want to discuss with you tonight.
low input, high throughput, no output
Sydney Brenner has been going around criticizing a particular style of doing genomics research. Sydney has been calling it "low input, high throughput, no output" science.
Sydney's absolutely right that some people are overenamored with using genomics for genomics' sake. I saw a thread this week on the Web on Reddit's bioinformatics question and answers board, where a PhD student asked:
* "[My] boss seems to think we can just send some of our material (tissue samples in this case) for RNA-seq and a Nature paper with our names on it will automatically be sent back. He has yet to determine a hypothesis, what controls to use, what analyses to perform and so on. He seems to think we can just save the samples and see if anything is upregulated or down regulated."*
I think that it's a healthy disdain for "low input, high throughput, no output" science that's behind Oliver's question.
sequencing is just a tool
OK, but look. Sequencing technology is just a tool. Back in the antediluvian age, we had these things called gels. An agarose or acrylamide gel has the wonderful property of separating macromolecules by their size. Once you start thinking about what you can do with size separation, you could come up with all sorts of simple and powerful ideas for different assays. People used size separation on gels to develop an enormous range of different creative applications. We had Southerns, and Northerns, and Westerns; we had gel-shifts for detecting specific DNA-protein binding, and so on. Microarray technology in genomics developed from hybridization slot blots, which developed from Northerns. DNA sequencing itself, by the Sanger or the Maxam-Gilbert methods, was originally enabled by size separation on gels.
Even though this technology for size selection was so endlessly useful, with all these myriad clever variations, I don't think we had meetings called "Gels for neuroscience". I don't think we had talks titled "Do we need gels?" It's always been obvious that gels are just a tool. If someone titled a talk about "do we need to run gels", we'd be confused; the answer would be, well, no, I suppose we don't need to, but why would you want to tie one hand behind your back?
Yet here we are at a meeting entitled "High throughput sequencing for neuroscience". And I'm not making fun of it. There's a lot of great people here, and we're going to hear a bunch of great talks and see a bunch of great posters. Obviously this is a hot and important topic or you wouldn't have come. My point is that there's something special about high throughput sequencing that we're worried about, something unlike how we think about gels.
Why isn't high throughput sequencing just another technique? Why don't we talk about high throughput sequencing like we talk about gels, or PCR, or microscopes?
I think there's three main reasons. One is that we have conflated high-throughput sequencing with "big science". The second is that we've conflated high-throughput sequencing with "discovery-based science", as opposed to hypothesis driven science. The third is that sequencing generates "big data", and we think we need to outsource our data analysis to the specialized skills of bioinformaticians. All three ideas are wrong. They are just historical and cultural artifacts in the biology community.
sequencing is not big science
Let me start with "big science".
Clearly the human genome project was a landmark in science and in DNA sequencing technology. The genome project has colored the way we think about high throughput sequencing. But there are several things about the genome project that made it unique.
The human genome project was not an experiment. The goal of the human genome project was essentially to produce a map: a long-lived piece of stable infrastructure that many people in the community would use over and over again.
As a map, it was not driven by a question. Questions take you wherever they lead you, often in unexpected directions. In contrast, from its inception, the human genome project had a well-defined, closed-ended goal: to obtain the complete 3.2 billion base pair sequence of the human genome. We could make reasonable plans for completing the genome, and for parceling out the work; it was just a matter of production capacities. You don't project completion dates and make Gantt charts when you're asking a biological question.
We did the genome project at the earliest moment that the community-wide cost/benefit calculation made sense: when the cost of doing it once and doing it well looked like it would outweigh the cost of all the labs that were sequencing small clones in piecemeal fashion. This meant that it was just barely technologically feasible, it was expensive, and it required a large, organized, and collaborative effort, both in the public project organized by the National Institutes of Health, the Wellcome Trust, and the Department of Energy, and in the private project at Celera Genomics.
We didn't really need to do any positive or negative controls in the human genome sequence. Instead, we had to do quality control and quality assurance testing. The genome was an engineering feat, not so much a scientific one.
The human genome project is not the only biology "big science" project worth doing, especially when we're talking about map-like resources that we want to generate once and well. There are other things we can think of where it's worth the time, money, and labor generating a big map once, if you can identify a closed-ended resource that lots of people will want to use. The Allen Brain Atlas is one example. Connectomes are another.
But I cannot think of a single example of a big science experiment in biology, as opposed to generating map-like, infrastructural resource. The closest thing to it might be a large clinical trial.
When we're using high-throughput sequencing, especially in the sorts of science that we're going to hear at this meeting, we're not doing a genome project, but I think we tend to model our work in terms of what made sense for a genome project. Should we release the raw data early, before we've analyzed it? Should we build a web site to distribute the data? Should we have visualization tools like the genome browser? Should we gather a lot of data systematically, then have a stable of captive bioinformaticians stare at it for a while to see if anything in it looks interesting?
sequencing is an assay
For a lot of the science we're doing, we should think about high throughput sequencing as just another assay we use. Now that the cost of sequencing has dropped so much, we are using sequencing more like we use gels.
Maybe everyone in this room already knows this. The talks we're going to hear at this meeting are all about using sequencing technology in a myriad of different creative ways to address hypothesis-driven questions in neuroscience. Still, let me elaborate the point a bit more.
It's true that there is an important component of "discovery-based" stuff that we can and should do with genomics. It's important to survey new territories systematically and know the full scope of what we're dealing with. It's important to do complete genome sequences, or cell-type-specific transcriptome sequences, or metagenomic surveys, to get our arms around the diversity of nature that we need to think about.
But now there's a rapidly increasing number of ways that we can use sequencing as an assay in an experiment. An RNA-seq experiment looking at differential gene expression is really "just" a high resolution, highly parallel Northern blot. A ChIP-seq experiment is a high resolution, highly parallel gel shift assay. A genome sequence of a new mutant is a substitute for doing genetic mapping crosses.
It's amazing to watch as essentially every molecular biology assay that we used to do on a small scale is getting converted to a sequence-based readout, and thus scaled up to more systematic coverage at higher resolution. Obviously you can sequence any naturally occurring DNA or RNA sample, to get genomes and transcriptomes. Once you can do that on bulk tissue or cultured cells, you start developing tricks to get spatiotemporal resolution in your samples, like technologies for getting cell-type-specific transcriptomes like TRAP from Myriam Heiman and Nat Heintz or INTACT both from Lee Henry and Steve Henikoff's lab, or subcellularly localized RNAs, like we'll hear from Erin Schuman. Once you start thinking of ways to pull down DNA or RNA associated with other stuff you can pull down, you get all the myriad variations on ChIP-seq and CLIP and RIP-seq. Once you start thinking of ways to convert all the old footprinting and protection assays to sequencing-based readouts, you get chromatin accessibility mapping, and you get genome-wide RNA secondary structure prediction from chemical and enzymatic protection assays. When you start thinking of ways to detect base modifications, you get methyl-seq for detecting CpG DNA methylation, and ways of mapping site-specific RNA modifications like 2'-O-methylation and pseudouridylation. When you start thinking of ways to fragment nucleic acids and religate them in situ, you get proximity assays like chromatin capture and ChIA-PET. Lior Pachter keeps a list of different *Seq technologies that's now over 100 long, I think.
You're not even limited to studying naturally occuring nucleic acids. We're starting to see synthetic biology tricks for generating DNA and RNA barcodes, as a way to combinatorially label complex biological mixtures. If there's a way to tag something with a uniquely generated DNA sequence, like a protein complex or a cell or a synapse, you can use sequencing as a systematic readout of that something. For example, maybe it's possible to use recombination between synthetic DNA barcoded neurons to systematically detect proximity between things, or connectivity between cells, as in a connectome, and Tony Zador and Josh Dubnau might talk about some of these ideas at this meeting. Maybe it's even possible to make synthetic Rube Goldberg devices in cells that transduce arbitrary signals, like action potentials, into DNA sequence, like little ticker tape machines, as proposed by George Church.
The biological questions that we're asking with these technologies are just as manifold. Sequencing has moved long way from the model of a genome center generating a single stable resource. Sequencing is almost a routine assay. Not really even an assay in itself; high-throughput sequencing is a readout modality, at the back end of scores of quite different biological assays, much like gels are.
biologists need to learn to do their own data analysis
I say "almost" a routine assay modality, because there is one major hurdle that we still need to overcome. I think it's the main reason that we don't think of high throughput sequencing like we think of other methods like gels or microscopes. We are not confident in our ability to analyze our own data. Biology is struggling to deal with the volume and complexity of data that sequencing generates. So far our solution has been to outsource our analysis to bioinformaticians.
If we were talking about a well-defined resource like a genome sequence, where the problem is an engineering problem, I'm fine with outsourcing or building skilled teams of bioinformaticians. But if you're a biologist pursuing a hypothesis-driven biological problem, and you're using using a sequencing-based assay to ask part of your question, generically expecting a bioinformatician in your sequencing core to analyze your data is like handing all your gels over to some guy in the basement who uses a ruler and a lightbox really well.
Data analysis is not generic. To analyze data from a biological assay, you have to understand the question you're asking, you have to understand the assay itself, and you have to have enough intuition to anticipate problems, recognize interesting anomalies, and design appropriate controls. If we were talking about gels, this would be obvious. You don't analyze Northerns the same way you analyze Westerns, and you wouldn't hand both your Westerns and your Northerns over to the generic gel-analyzing person with her ruler in the basement. But somehow this is what many people seem to want to do with bioinformaticians and sequence data.
It is true that sequencing generates a lot of data, and it is currently true that the skills needed to do sequencing data analysis are specialized and in short supply. What I want to tell you, though, is that those data analysis skills are easily acquired by biologists, that they must be acquired by biologists, and that that they will be. We need to rethink how we're doing bioinformatics.
scripting is a fundamental lab skill, like pipetting
The most important thing I want you to take away from this talk tonight is that writing scripts in Perl or Python is both essential and easy, like learning to pipette. Writing a script is not software programming. To write scripts, you do not need to take courses in computer science or computer engineering. Any biologist can write a Perl script. A Perl or Python script is not much different from writing a protocol for yourself. The way you get started is that someone gives you one that works, and you learn how to tweak it to do more of what you need. After a while, you'll find you're writing your own scripts from scratch. If you aren't already competent in Perl and you're dealing with sequence data, you really need to take a couple of hours and write your first Perl script.
The thing is, large-scale biological data are almost as complicated as the organism itself. Asking a question of a large, complex dataset is just like doing an experiment. You can't see the whole dataset at once; you have to ask an insightful question of it, and get back one narrow view at a time. Asking insightful questions of data takes just as much time, and just as much intuition as asking insightful questions of the living system. You need to think about what you're asking, and you need to think about what you're going to do for positive and negative controls. Your script is the experimental protocol. The amount of thought you will find yourself putting into many different experiments and controls vastly outweighs the time it will take you to learn to write Perl.
I'll give you a specific example. I'm going to tell you just one thing worth getting started with, and one key algorithm to do it well. If you learned to implement it in Perl -- and you could do this in an afternoon, with a few lines of Perl code -- I think you would find yourself endowed with a superpower, like Wonder Woman with her golden lasso of truth, and it's a superpower that a biologist can use with surprising effectiveness on large data sets.
The idea is simple: random sampling. If someone gives you a dataset of a zillion things, whether they're Illumina reads or lines of tabular data on someone's unreadably large summary spreadsheet, and whether it's your results or data from someone's paper that you're reviewing, pretty much the first thing you should do is to take a small random sample of those zillion things and actually *look* at them. If you look at a random sample of 10 things and 9 of them are an artifact, then you've just discovered an artifact that accounts for 90% of the entire dataset. The reason that this is a particularly powerful thing for a biologist to know how to do, as opposed to a computer scientist or a bioinformatician, is that a biologist has the intuition to look in detail at a handful of examples and know if they make sense or not.
You don't want to just take the first ten lines of a big data file, because the file's likely in some sort of order and you'll probably get a biased sample, like maybe the left telomere of the first chromosome. So you need to know how to randomly sample the file. There's lots of ways to do this crudely without thinking about it. But if you want to start learning the craftmanship of what a simple correct algorithm can do, here's your homework assignment: the algorithm for taking a uniform random sample of k things from a big dataset of N things, in a single pass across the dataset without having to keep more than k things in memory at a time, is called "reservoir sampling". You can look it up in Wikipedia, and you can (I promise you) learn to code it in an hour or so. And once you learn to do it, you're going to find yourself doing it routinely on every big dataset you have to look at. You'll find yourself surprising people with your intuition for your data, because people don't think we can just look at big data sets, they think we need fancy statistics and visualization, and I think they often miss artifacts that would be obvious to a biologist looking at individual data elements.
Once you've started using scripts to do look at your data in different ways, you're doing experiments on your data, and you'll naturally start thinking about control experiments, and you will realize something very powerful: you can design control experiments that double-check whether all those textbook statistical tests you've been doing actually make any sense. Computers are way faster now than when most statistical tests were developed. You can write a protocol - another script - to generate synthetic datasets according to what you think the negative and positive controls should be, then analyze your synthetic control data. Controls force you to make your hypotheses explicit. If you want to know how many false positives you expect in your computational query, you simply generate negative control data and count.
Because there's so many different experiments that you want to do on your data, you cannot expect to have a simple web page or big red button to push, any more than we expect there to be a kit we can buy for every wet lab experiment we do. Expecting someone else to build you a software system that's "user-friendly" so we can avoid writing scripts or using the command line to analyze our data is like expecting someone to build you a robot for every molecular biology experiment you do, so you can avoid having to pick up a pipette.
Mind you, there's nothing wrong with kits. Once there's something that a lot of people are doing over and over again, it's nice to package it up into a well-tested kit. Bioinformaticists should be more in the business of making the computational equivalent of kits -- robust, well tested pipelines and scripts that biologists can mix and match in their experiments. But biologists should be fully in charge of how those kits are used in experiments and controls, whether we're talking about wet or computational experiments.
I can push this analogy to pipetting and kits a step further. If biologists did their own computational data analysis experiments using scripts like pipettes, and bioinformaticists wrote robust pipelines and components for those analyses that biologists used like we use kits, then computational biology research groups like mine are in the business of enzyme engineering. The stuff that I do, and my computational group does, is high-performance computing, algorithm design, probabilistic inference, and software engineering. We aim to identify precise and general analysis questions where there's a very well defined data input, and a well-defined analysis output, like a substrate-product relationship, and we try to write code that achieves that transformation with high fidelity and great kinetics. Bioinformaticists then use our software in their pipelines and their "kits", and biologists end up using our work indirectly, via these other bioinformatics layers. This part of computational biology, the algorithms and mathematical inference part, is another important part of our data ecosystem that biology is somewhat struggling with, but for different reasons, and that's a topic for another time.
I think I can sum up my main point like this: if you find yourself thinking about hiring a "bioinformatics postdoc" to analyze sequence data in your group, please rethink. You wouldn't hire a postdoc to measure all your gel bands with rulers, I hope. If you're building a big genome-like resource, or if you're doing computational methods development building tools that will be used by many biologists in the community, maybe you do need dedicated bioinformatics people. But I think in almost all cases, if you're a neuroscientist trying to ask neuroscience questions, I think you are going to be far better off finding people who are motivated by your particular biological question, and expecting your people to learn how to analyze their own data, with well-developed biological intuition for the details of the questions they're asking, the assays they're doing, and what might go wrong in their experiments. Writing scripts and using the command line might look daunting because it's not yet a familiar way to do experiments in biology, but it's way easier to learn than you might think, and it's going to have to become as routine as pipetting, so we need to get started.