New open positions

We have two new open positions in the group. We are looking for a bioinformatics analyst and a scientific software engineer to join the teams that develop the HMMER and Infernal software packages for biological sequence analysis. The teams are growing, with a key new team member Nick Carter, who joined us from high performance computing research at Intel and a previous computer science faculty position at U Illinois. We’re aiming in particular to bring out the next release of HMMER, the long-fabled HMMER4. These positions offer the short-term opportunity to help us bring these existing projects to the next level, and a longer-term opportunity to participate in a variety of fundamental computational biology algorithms research and software development.

The scientific software engineer will work most closely with Nick and myself on HMMER, and later on Infernal in collaboration with Eric Nawrocki (NIH/NCBI). Our codebases are ANSI C99, we take advantage of SIMD vectorization instructions on multiple platforms, and we’re working hard on parallelization efficiency with multithreading (POSIX threads) and message passing (MPI), so we’re looking for someone with expertise in these technologies. We currently work primarily on Apple OS/X and Linux platforms ourselves, but our code has to build and work on any POSIX-compliant platform, so we’re also looking to expand our automated build/test procedures. Our codebases are open source and we work with standard open source tools such as git, autoconf, and GNU make, and development is distributed amongst a small group of people throughout the world, especially including collaborators at NIH NCBI, U Montana, HHMI Janelia Farm, and Cambridge UK; you can see our github repos here. The official advertisement for the position is online here, with instructions on how to apply.

The bioinformatics analyst will work most closely with our bioinformatics secret agent man Tom Jones, Nick, and myself. We’re looking for someone to bridge the gap between the computational engineering and the biological applications, someone who will have one hand in the development team (working on how our command line interfaces work, and how our output formats come out), and another hand on our collaborations within Harvard and with the outside world using and testing our software on real problems. We want someone thinking about benchmarking and testing, with expertise in a scripting language (Python, Perl); we want someone working on user-oriented documentation and tutorials (Jekyll, Markdown); we want someone thinking about ease and elegance of use; we want someone working on how our tools play well with others, including Galaxy and BioConductor, and liasing between us and the various package managers who bundle our software (such as Homebrew, MacPorts, all the various Linuxen). The official ad and instructions to apply are online here.

We are reading applications now, and will accept applications in a rolling fashion for at least another month or so; beyond that depends on how our candidate pool looks. The positions are available immediately. Our funding for them has just activated, supported by the NIH NHGRI under the NIH’s program (PA-14-156) for Extended Development, Hardening and Dissemination of Technologies in Biomedical Computing, Informatics, and Big Data Science; we gratefully acknowledge this new support, as I’ve rejoined the NIH community after ten years away at a monastery.

Infernal 1.1: RNA alignment and database search, 10,000x faster

One of our lab’s goals is to make it possible to systematically search for homologs of RNAs in genomes, not just by looking for sequence conservation but also by looking for RNA secondary structure conservation. A powerful model framework for RNA structure/sequence comparison, called profile stochastic-context free grammars (profile SCFGs), was introduced in the mid-1990s both by Yasu Sakakibara and by us. But profile SCFG methods are among the most computationally intensive algorithms used in genome sequence analysis, requiring (in their textbook description, anyway) O(N^4) time and O(N^3) memory for an RNA of N residues. Profile SCFG implementations like our Infernal software have required immense computational power to get even the most basic sort of searches done.

We are happy to announce a new landmark in our work on these methods, with a new version of Infernal that is about 100x faster than the previous (1.0) version, and 10,000x faster than when Eric Nawrocki started working on making Infernal fast enough for routine use. Over at infernal.janelia.org, Eric has made available the first release candidate of Infernal 1.1, 1.1rc1, including source code and binaries for Linux and MacOS/X. A typical RNA homology search of a vertebrate genome that used to require a cpu-year can now be done in about an hour on a single CPU, or a few seconds on a cluster.

So really for the first time, Infernal has become practical for systematic RNA sequence analysis of whole genomes. Roughly speaking, Infernal 1.1 is running at a speed comparable to what HMMER2 ran at — we’ve brought the RNA search problem down from the utterly ridiculous to the merely difficult.

The next version of the Rfam RNA sequence family database will be the first to be computed entirely natively with Infernal RNA structure comparison, instead of using BLASTN as a prefilter. An all-vs-all comparison of all 2000 Rfam models against the entire EMBL DNA database (170 Gb) would take 30,000 cpu-years using Infernal 0.55; now with Infernal 1.1, that enormous Rfam compute is only going to take us about a day on Janelia’s cluster.

Like Infernal 1.0, 1.1 is achieving its speed by using profile HMMs as heuristic prefilters. Whereas 1.0 used HMMER2-like prefilters, 1.1 has now switched to using HMMER3‘s vector engine, sharing code with Travis Wheeler’s soon-to-be-announced nhmmer program for DNA/DNA comparison.

Happy RNA hunting — and don’t let anyone tell you that O(N^4) algorithms aren’t tractable!

Infernal 1.0

Infernal logo

Infernal 1.0 — the first production release of our software for RNA sequence/structure homology search and alignment — has been released. Source code and documentation are available over at infernal.janelia.org.

Infernal development is now led by Eric Nawrocki in the lab.

Infernal’s actually been used by the Rfam database since about 2002, in pre-1.0 versions. This is the first version that we think is really ready for production use, more than just a testbed for algorithms. The 1.0 code went through several release candidates in the past few months. We think we have all the obvious bugs shaken out of it.

The problem Infernal addresses is RNA homology search. Continue reading