Biological Data Analysis

I’m starting to plan a new Harvard course that’ll be called Biological Data Analysis. Biology is going through a culture change. It’s suddenly become a data-rich, computational analysis-heavy science. Are we going to outsource data analysis to bioinformaticians and data science specialists, or are biologists going to analyze their own data? There’s always advantages to specialization, and we need bioinformatics and data science. But I also feel that we are dangerously weak in training biologists to think about their own data. The usual response I get when I talk about it is something like “you can’t expect wet lab biologists to learn how to program”.

What I want to teach in Biological Data Analysis is that writing scripts and using the command line for data analysis is not software engineering, it’s just a simple and essential thing that a wet lab biologist can do, and needs to do. I’m going to teach from the point of view that biologists already have a special advantage in large-scale data analysis: we are trained to expect that we will be screwed by our experiments. We should be treating data analysis the same way. Like doing an experiment on a complicated organism, any given data analysis only gives you a narrow glimpse into a large data set. God only knows what else is going on in the data that you’re not seeing. Like doing experiments, you need to design positive and negative controls to protect yourself from the hundred different ways that nature (and computers) are going to mess with you. Writing scripts that generate positive and negative controls for a data analysis is a powerful and biologically motivated thing to know how to do.

Once you’re generating negative control data — “here’s what the data would look like if there were no effect to be found” — you’re actually doing statistics, but in an intuitive and motivated way that any biologist can understand. Instead of learning a bunch of incantations and lore about t-tests, you’re forced to think directly about what your null hypothesis is, because you have to make a negative control data set according to that null hypothesis. The “p-value” is  directly the probability that you observe a signal in your negative control. This style of simulation-driven analysis is enabled by modern computational power plus the ability to write simple scripts and use the command line. You don’t have to learn statistics per se. You have to learn how to do computational control experiments. I expect that if a biologist learns the simulation-driven style of analysis first, then they’re motivated to go on to learn more serious statistical analysis as they need it… and they’re armed with a powerful way to check analytic results against intuitive simulations.

If this makes any sense to you, and if you happen to be a Harvard PhD student graduating this year, and you’re thinking it might be nice to take a year and do some teaching… boy, do I have a deal for you. Harvard has a thing called the College Fellows Program. This is a one-year position (renewable for one more) that focuses on teaching and course development. We’ve just posted an ad looking for a College Fellow to help me develop and teach Biological Data Analysis. Application deadline is April 15. Feel free to contact me directly with questions!


Open software/web engineer positions

The lab’s growing – the new space opened in November – and we’re starting to get settled in. Now we’re looking for a scientific software engineer and a web portal engineer to join the group.

The scientific software engineer will play key roles in the HMMER and Infernal projects, including optimization, parallelization, automated testing, and coordination with other open source developers in academia and industry. I’m especially looking for someone with experience in parallelization, including SIMD vector parallelization, POSIX threads, and MPI.

The web portal engineer will be a jack (or jill!) of many trades, taking charge of many the lab’s various outward-facing portals for distributing software, papers, and documentation from our group, including our web site and our GitHub repositories.

We have ads out for these positions with some additional information, such as these at Nature for the scientific software engineer and the web portal engineer. To apply, you can go through one of the ads or straight to the HHMI application system: here, for the scientific software engineer, or here, for the web portal engineer. The ads run through the end of February, I believe, but we will keep looking until these positions are filled.


New positions opening in the group

Construction on the new laboratory here at Harvard is starting to look like it might actually get done. They’re projecting our move-in date to be the first week of November. I’ve been meeting with prospective rotation students and postdoc candidates, even though we still have no place to put anyone or anything. It feels like we have a bunch of planes in the air stacked up in holding patterns, waiting for the bulldozers to finish the runway.

I plan to search for three key staff positions this fall. The first open position is our all-important administrative lab coordinator. I’m already sorely missing our lab coordinators over the years at Janelia, Margaret Jefferies, Patrice Neville, and Sarah Moorehead, and we don’t even have an open lab here yet. An ad is now out in all the fashionable color supplements (here’s the one at Nature Jobs). Help us spread the word, if you know of someone who might be interested in the position!

The other two positions will be two scientific software developers, working on our main codebases in HMMER, Infernal, and Easel. We’re going to make a push on the high performance computing ends of things, including parallelization (SIMD vectorization, threading, and MPI), so I want to find a software engineer with experience in C programming in parallel HPC scientific applications. We’re also going to make a push in the visualization and web development end of things, so I want to find a web developer with experience in data visualization who’d also be interested in being a general guru of much of our front-facing stuff, including our web presence, issue tracking, github, and software distribution. I don’t have ads out for these positions yet, but will do soon-ish.

Open faculty position in Harvard FAS Systems Biology

harvard_logoHarvard’s FAS Center for Systems Biology has opened a search for a new tenure-track faculty member at the assistant professor level. Sharad Ramanathan and I are the co-chairs for the search committee.

From the ad:

The Center emphasizes quantitative approaches to fundamental problems in biology. It aims to foster interactions across disciplinary boundaries, housing faculty from a spectrum of academic departments in addition to the Bauer Fellows. Exceptional candidates in any area of quantitative biology will be considered, including those taking computational, theoretical, and/or experimental approaches.

Faculty associated with the Center for Systems Biology have access to facilities and opportunities for collaborative research not only through departments but also through the Bauer Core facilities, the Center for Nanoscale Systems, the Broad Institute, and the Center for Brain Science. The successful candidate will hold an academic appointment in a natural science department such as, but not restricted to, Molecular and Cellular Biology, Organismic and Evolutionary Biology, Physics, Applied Mathematics, or Chemistry and Chemical Biology.

The application web page is here.

HMMER mission control: we are go for launch vehicle separation(s)

hmmer_titlebar_small_textHMMER web servers were officially launched today at the EMBL European Bioinformatics Institute (EBI) in Cambridge UK. You can read an EBI press release here. This marks the completion of the pilot HMMER server project at Janelia Farm and its transition to the EBI. All of this has been led by Rob Finn, now the head of sequence family resources at EBI.  A huge thank you goes out to Rob and his team at Janelia (Jody Clements, Bill Arndt, and Ben Miller), to HHMI for funding the pilot project, and to the EBI for agreeing to adopt the pilot project and making it all grown up and respectable.

Today we’ve separated the code development home ( from the servers ( Nonetheless, the two sites are pointing back and forth at each other, so you can download the current HMMER release and documentation from EBI, and you navigate to EBI’s search pages starting from We even think that RESTful URLs for the pilot servers at will continue to be forwarded and served properly by the new EBI servers. Let us know if you experience any glitches.

Rob’s team at EBI will run the EBI servers, and the Eddy/Rivas lab will continue to be responsible for  Because of the terrifyingly sophisticated planning processes we employ in the HMMER project, or maybe it’s just a coincidence, the EBI announcement comes just days before our move to Harvard. Everything HMMER-related at Janelia will now wind down quickly over the next few months. A big change for us.  If you’re used to using, switch to using the project’s permanent URL at We’re about to turn out the lights here at Janelia.