HMMER3 alpha test release: Monday January 12

HMMER3 is spawning a lot of work. It may never be "done" to my satisfaction. It might be that the best way forward is to release it and get it onto an evolutionary trajectory. So this'll be an odd "alpha test". A lot of the functionality that I plan to put in HMMER3 isn't there. But, ready or not, barring some last minute glitch, I'm releasing the first alpha test Monday.

There's a main core of Pfam-like support: hmmbuild (build a profile HMM from a multiple alignment), hmmsearch (search a sequence database with a profile HMM), hmmscan (search a sequence against a profile database; what H2 called 'hmmpfam'), and hmmalign (align sequences to a profile HMM and output an multiple alignment).

There's another growing set of BLAST-like functionality: phmmer (search a single sequence against a sequence database, like BLASTP), and jackhmmer (iterative search, like PSI-BLAST). phmmer will likely be part of the first alpha release; jackhmmer will not, because I'm not happy with its output format or its command line options yet, though this really shouldn't take long to resolve.

I'm going to document a "safe path" for using HMMER3; 'do this, and it won't explode and kill you'. Like, use Stockholm format alignments, and don't rely on the input parsers for other alignment formats yet. I've spent a lot of time on the core science, but not enough yet on the useful edges. One of the objectives of the alpha test will be to identify the landmines I don't know about yet, and widen the safe paths.

Another objective of the alpha test will be to finalize input and output formats. Martin Gollery asked in a comment whether the formats would be the same as HMMER2.3. It's not going to be possible to make the formats exactly the same, including the HMM file formats, so on the one hand, HMMER3 will be like a new software package. On the other hand, the formats will still be straightforward and similar to previous HMMER or BLAST formats, so I'm hoping it doesn't cause too much pain for people's parsers. I'm also happy to take feedback during the test period to change H3 formats to make things easier on people, to the extent that I can. H3 also already has i/o conversion to/from HMMER2 profile HMM format, to minimize destruction of previous HMMER2-dependent addon tools.

One of the biggest missing pieces of functionality is the ability to do anything but protein/protein alignment. The ability to search DNA sequence databases isn't there yet, though there's tendrils of support wound throughout the codebase, sleeper agents waiting to be activated. There's a technical reason why DNA searches aren't going to be trivial, due to the way I've built the heuristic acceleration code.