New open positions

We have two new open positions in the group. We are looking for a bioinformatics analyst and a scientific software engineer to join the teams that develop the HMMER and Infernal software packages for biological sequence analysis. The teams are growing, with a key new team member Nick Carter, who joined us from high performance computing research at Intel and a previous computer science faculty position at U Illinois. We’re aiming in particular to bring out the next release of HMMER, the long-fabled HMMER4. These positions offer the short-term opportunity to help us bring these existing projects to the next level, and a longer-term opportunity to participate in a variety of fundamental computational biology algorithms research and software development.

The scientific software engineer will work most closely with Nick and myself on HMMER, and later on Infernal in collaboration with Eric Nawrocki (NIH/NCBI). Our codebases are ANSI C99, we take advantage of SIMD vectorization instructions on multiple platforms, and we’re working hard on parallelization efficiency with multithreading (POSIX threads) and message passing (MPI), so we’re looking for someone with expertise in these technologies. We currently work primarily on Apple OS/X and Linux platforms ourselves, but our code has to build and work on any POSIX-compliant platform, so we’re also looking to expand our automated build/test procedures. Our codebases are open source and we work with standard open source tools such as git, autoconf, and GNU make, and development is distributed amongst a small group of people throughout the world, especially including collaborators at NIH NCBI, U Montana, HHMI Janelia Farm, and Cambridge UK; you can see our github repos here. The official advertisement for the position is online here, with instructions on how to apply.

The bioinformatics analyst will work most closely with our bioinformatics secret agent man Tom Jones, Nick, and myself. We’re looking for someone to bridge the gap between the computational engineering and the biological applications, someone who will have one hand in the development team (working on how our command line interfaces work, and how our output formats come out), and another hand on our collaborations within Harvard and with the outside world using and testing our software on real problems. We want someone thinking about benchmarking and testing, with expertise in a scripting language (Python, Perl); we want someone working on user-oriented documentation and tutorials (Jekyll, Markdown); we want someone thinking about ease and elegance of use; we want someone working on how our tools play well with others, including Galaxy and BioConductor, and liasing between us and the various package managers who bundle our software (such as Homebrew, MacPorts, all the various Linuxen). The official ad and instructions to apply are online here.

We are reading applications now, and will accept applications in a rolling fashion for at least another month or so; beyond that depends on how our candidate pool looks. The positions are available immediately. Our funding for them has just activated, supported by the NIH NHGRI under the NIH’s program (PA-14-156) for Extended Development, Hardening and Dissemination of Technologies in Biomedical Computing, Informatics, and Big Data Science; we gratefully acknowledge this new support, as I’ve rejoined the NIH community after ten years away at a monastery.

from zero to python

draw-an-owl

Students actually showed up, so we really do have to teach the course. MCB112 Biological Data Analysis is now in its first week.

The tricksiest bit in the first couple weeks is bringing people up to speed in writing Python, for people who’ve never written code before. We trust in the power of trial and error. We give working example scripts that are related to what the students are asked to do on a problem set. Developing code by mutation, descent with modification, and selection: coding for biologists.

Soon we’ll start to lift the training wheels, while trying not to leave people in a “now draw the rest of the damn owl” situation.

When you’re learning to code, with every line you type you’re looking something up. Your concentration is getting broken all over the place as you try to express the Simplest Stupid Thing (Why Don’t You Work gaaaah $%^&#@). If you’re also trying to learn something else at the same time that requires hard thinking – an algorithm, a mathematical equation, a biological analysis approach – really just about the last thing you need is to have your concentration broken every ten seconds because you can’t express yourself. The best way to learn to code isn’t to start by writing scientific code. It’s better to code something fun, something that you’re completely absorbed by, something that isn’t too conceptually difficult. You want to have only the code frustrating you, while the goal pulls you in and keeps you engaged.

But I can’t exactly recommend that students learn to code the way that I did. Sure, go get yourself absorbed in an early Internet massive military-industrial simulation game. Automate your country’s economy, re-invent Dijkstra’s shortest path algorithm to distribute your resources, make an interactive display of your map, reverse engineer the client/server communication interface so you can launch automated attacks… no, this is no way to do a PhD. Even if it does mean you end up knowing C and Perl and understanding dynamic programming, GUI development, and networked computing.

So alas, we’ll try to generate entertainment value in more socially acceptable ways, like sand mouse mysteries in the problem sets, or teasing Lior Pachter. We’ll see if it’s enough. If not, maybe I’ll have to see if the old Empire code still compiles.

open position: lab coordinator

We have an open position for a new lab coordinator. Ads are running at Nature and elsewhere. Our LC is our point of contact for everything administrative that makes the lab run smoothly, including responsibility for our financial accounting/purchasing with both HHMI’s and Harvard’s systems, tracking our funding, managing our travel planning, and coordinating the lab’s internal scheduling and communications (a mix of Google Calendar, Google Drive, OneNote, and Slack), among other jack-of-all-trades things. If you know of candidates, pass this info along — and if you’re a candidate yourself, see one of the ads for info on how to apply (to HHMI HR, who do the preliminary resume screening for us).

Biological Data Analysis

I’m starting to plan a new Harvard course that’ll be called Biological Data Analysis. Biology is going through a culture change. It’s suddenly become a data-rich, computational analysis-heavy science. Are we going to outsource data analysis to bioinformaticians and data science specialists, or are biologists going to analyze their own data? There’s always advantages to specialization, and we need bioinformatics and data science. But I also feel that we are dangerously weak in training biologists to think about their own data. The usual response I get when I talk about it is something like “you can’t expect wet lab biologists to learn how to program”.

What I want to teach in Biological Data Analysis is that writing scripts and using the command line for data analysis is not software engineering, it’s just a simple and essential thing that a wet lab biologist can do, and needs to do. I’m going to teach from the point of view that biologists already have a special advantage in large-scale data analysis: we are trained to expect that we will be screwed by our experiments. We should be treating data analysis the same way. Like doing an experiment on a complicated organism, any given data analysis only gives you a narrow glimpse into a large data set. God only knows what else is going on in the data that you’re not seeing. Like doing experiments, you need to design positive and negative controls to protect yourself from the hundred different ways that nature (and computers) are going to mess with you. Writing scripts that generate positive and negative controls for a data analysis is a powerful and biologically motivated thing to know how to do.

Once you’re generating negative control data — “here’s what the data would look like if there were no effect to be found” — you’re actually doing statistics, but in an intuitive and motivated way that any biologist can understand. Instead of learning a bunch of incantations and lore about t-tests, you’re forced to think directly about what your null hypothesis is, because you have to make a negative control data set according to that null hypothesis. The “p-value” isĀ  directly the probability that you observe a signal in your negative control. This style of simulation-driven analysis is enabled by modern computational power plus the ability to write simple scripts and use the command line. You don’t have to learn statistics per se. You have to learn how to do computational control experiments. I expect that if a biologist learns the simulation-driven style of analysis first, then they’re motivated to go on to learn more serious statistical analysis as they need it… and they’re armed with a powerful way to check analytic results against intuitive simulations.

If this makes any sense to you, and if you happen to be a Harvard PhD student graduating this year, and you’re thinking it might be nice to take a year and do some teaching… boy, do I have a deal for you. Harvard has a thing called the College Fellows Program. This is a one-year position (renewable for one more) that focuses on teaching and course development. We’ve just posted an ad looking for a College Fellow to help me develop and teach Biological Data Analysis. Application deadline is April 15. Feel free to contact me directly with questions!

 

Open software/web engineer positions

The lab’s growing – the new space opened in November – and we’re starting to get settled in. Now we’re looking for a scientific software engineer and a web portal engineer to join the group.

The scientific software engineer will play key roles in the HMMER and Infernal projects, including optimization, parallelization, automated testing, and coordination with other open source developers in academia and industry. I’m especially looking for someone with experience in parallelization, including SIMD vector parallelization, POSIX threads, and MPI.

The web portal engineer will be a jack (or jill!) of many trades, taking charge of many the lab’s various outward-facing portals for distributing software, papers, and documentation from our group, including our web site and our GitHub repositories.

We have ads out for these positions with some additional information, such as these at Nature for the scientific software engineer and the web portal engineer. To apply, you can go through one of the ads or straight to the HHMI application system: here, for the scientific software engineer, or here, for the web portal engineer. The ads run through the end of February, I believe, but we will keep looking until these positions are filled.

 

New positions opening in the group

Construction on the new laboratory here at Harvard is starting to look like it might actually get done. They’re projecting our move-in date to be the first week of November. I’ve been meeting with prospective rotation students and postdoc candidates, even though we still have no place to put anyone or anything. It feels like we have a bunch of planes in the air stacked up in holding patterns, waiting for the bulldozers to finish the runway.

I plan to search for three key staff positions this fall. The first open position is our all-important administrative lab coordinator. I’m already sorely missing our lab coordinators over the years at Janelia, Margaret Jefferies, Patrice Neville, and Sarah Moorehead, and we don’t even have an open lab here yet. An ad is now out in all the fashionable color supplements (here’s the one at Nature Jobs). Help us spread the word, if you know of someone who might be interested in the position!

The other two positions will be two scientific software developers, working on our main codebases in HMMER, Infernal, and Easel. We’re going to make a push on the high performance computing ends of things, including parallelization (SIMD vectorization, threading, and MPI), so I want to find a software engineer with experience in C programming in parallel HPC scientific applications. We’re also going to make a push in the visualization and web development end of things, so I want to find a web developer with experience in data visualization who’d also be interested in being a general guru of much of our front-facing stuff, including our web presence, issue tracking, github, and software distribution. I don’t have ads out for these positions yet, but will do soon-ish.