Summary of known bugs (so far) in 3.0b2

The truth will set you free. But not until it is finished with you.

David Foster Wallace, Infinite Jest

With the release of Pfam 24, HMMER3 beta test code has been pressed into service just a bit earlier than we're comfortable with. Though the 3.0b2 code seems to be holding up reasonably well, we are aware of six bugs -- two in how H3 handles Pfam GA/NC/TC cutoffs, two in how H3 handles the '*' nonresidue character that some sequence files use to represent stop codons in translated DNA, one in hmmscan's tabular domain output ("domtbl") format, and one in sequence description lines with % characters causing crashes in jackhmmer. All six are fixed in the development trunk. We are starting our test/release cycle for 3.0b3 (the third and probably final beta release) now. The fixes will be included in 3.0b3, along with some important new functionality I'll discuss on the blog soon.

More details on the six bugs...

GA/TC/NC thresholds.

3.0b2 has a deliberate "design feature" in how it applies Pfam GA/TC/NC bit score thresholds that has disturbed many people, including Alex Bateman at Pfam, and (after taking a severe email beating from Alex) I agree. The "feature" is as follows (I'll use the GA gathering thresholds for an example, but the same holds for TC trusted or NC noise cutoffs). If a sequence passes the per-sequence score threshold GA1, then at least one domain (the best-scoring) is always reported in the per-domain output, regardless of whether the domain passed the per-domain threshold GA2 or not . My defective thinking was that if someone sees a sequence in the per-sequence output, but no domain in the per-domain output for that sequence, I'd get bug reports about that. (And I probably will, but...) The problem with that is, anyone trying strictly to apply Pfam GA cutoffs gets thwarted, because domains appear in the per-domain output that have not met the GA2 threshold. The development trunk now changes this behavior; GA/TC/NC bit score thresholds will be applied strictly.

Additionally, two serious bugs were found in GA/TC/NC bit score thresholding. The most serious one (#h61) is that hmmscan simply doesn't apply the correct thresholds at all; 3.0b2 hmmscan must not be used with --cut-ga, --cut-tc, or --cut-nc options. hmmsearch (the normal, non-MPI version) is fine. A slightly lesser one (#h60) is that the MPI version of hmmbuild fails to record the thresholds in its output HMM files, for an entirely separate reason; if you build your own HMM files using hmmbuild --mpi, don't use --cut-ga, --cut-tc, or --cut-nc bit score thresholding.

The '*' nonresidue/stop codon character.

It is common to find '*' characters in sequence files, when people 'translate' a stop codon. Until recently, HMMER and its underlying Easel library disallowed * characters in sequence files for the perfectly rational reason that * residue codes aren't legal tender in any IUPAC/IUBMB standard or proposed extension to the standard. I fielded many complaints about this behavior, for the perfectly rational reason that * characters are commonly used regardless of what the musty old standards allow.

The * code is irritating to handle, because it's literally not-a-residue in protein sequences. It makes no sense to generate (or score) a nonresidue from an HMM that's supposed to be generating a probability distribution over amino acid sequences. (BLAST just gives it a score, but in a probability model, we can't be quite that arbitrary.) When we extend H3 to models that generate codons (for alignment of protein models to DNA target sequences), that'll be a different thing; in that case, the handling of stop codons will be well defined.

Support for the * nonresidue character in amino acid sequences was retrofitted into the code in May 2009, and first appeared in 3.0b1. We've been shaking out a few bugs in the retrofit ever since. 3.0b2 has two known bugs in * handling. One (#h57) is most noticeable in jackhmmer, but might also be seen in hmmsearch or phmmer if you use the -O option to output an alignment of your significant hits; the program will crash out with the telltale error message "backconverted subseq didn't end at expected length". The other (#h62) manifested itself as a segmentation fault in the search programs (hmmscan,hmmsearch,phmmer,jackhmmer), in a sporadic and elusive fashion that makes it difficult to isolate test cases.

There's an oddity in how H3 handles * "residues" that eventually I'll need to discuss and document. With other priorities pressing, I'm waiting to see if anyone notices.

two other bugs

#h59: jackhmmer (and only jackhmmer) may crash with a segmentation fault on some sequence database files, when sequence description lines contain a % character. (C programmers will be able to guess what the problem was.)

#h58: in the domtbl output format for hmmscan, the target and query lengths are reversed.