it is not necessary to accept everything as true, one must only accept it as necessary.
Franz Kafka, The Trial
As the unearthly whine of the Janelia compute cluster subsides after testing the beta1 release, power and lights are flickering back to life in Virginia, and I have a quiet moment to talk about failure.
Rob Finn at Pfam has uncovered HMMER3's most important known failure mode to date. In Rob's experiments, for about 25% of Pfam families, HMMER3-based Pfam models identify fewer homologs than HMMER2 did. In many (most) cases, these are likely to come from HMMER2 false positives (HMMER3 has better E-values, better treatment of biased composition, and a significantly lower false positive rate than HMMER2). Some cases are just thresholding issues with the Pfam curated GA cutoffs (gathering thresholds), as we all get used to the new HMMER3 bit score distributions, and Rob's dialing those in. Some differences come from small sequence fragments; HMMER3 models tend to have lower information content (average score) per position, so they require somewhat longer alignments than HMMER2 did to achieve significant scores, and HMMER3 may miss very small fragments (10 residues or so) that HMMER2 would have seen. (We might be able to do something about this; there's some ideas on our road map.)
But the really interesting and horrifying failure mode that Rob's seen is that there are a handful of models in Pfam that are "tuned" to HMMER2's default use of glocal alignment (a complete domain with respect to the query model, local in the target sequence). These are short models (less than 100 consensus residues), with lots of extremely diverse sequences in the alignment (low average sequence identity). This can give a case where the average score per homologous position is low, across a short model, such that true homologs are just barely resolved from the noise even when you have all the consensus residues aligned. In local alignments (HMMER3's default and only mode), the probability theory assesses an extra bit score penalty of log_2 (2/(M(M+1))) (for the extra freedom of finding start/end positions anywhere in a model of consensus length M), which works out to about -12 bits for a model of M=100. If the true homologs were resolved from noise by less than 12 bits by glocal alignment, then in local alignments, they may dip into the noise... and disappear.
Now a mea culpa is in order. When the 2008 PLoS Comp Bio paper on HMMER3's local alignment statistics was peer reviewed, one of the reviewers, John Spouge from NCBI, wanted me to comment on glocal alignment being more sensitive/specific than local alignment (when a glocal alignment exists, anyway). Indeed, several people in the literature had reported on H2's poor local alignment performance compared to its glocal alignment performance, and we'd reproduced those observations here; but I became convinced that the issue was entirely due to effects of the information content per position. Glocal alignment is not very sensitive to info content/position, but local alignment is; and when we started using entropy weighting to reduce info content/position in HMMER models, following in the footsteps of Kevin Karplus and his group at UCSC, our H2 local alignment performance on average in benchmarks exceeded H2 glocal performance. So my response to John's point in the reviews was just that, that I thought it was largely an information content issue, and that I didn't really see any cases where glocal outperformed local any more. Well, um; now I do. (In average case behavior, I still do think my response to John was correct; I didn't realize that there was a subset of models that would become particularly problematic.)
How bad is it? Hey, it's only a small set of models! Maybe we can just sweep them under the rug. Who would possibly notice? After all, the average case behavior of H3 in benchmarks is pretty damned great, if I do say so myself.
Well, ok, let's see. What would these Pfam families be, that would be short in length, and have large, extremely diverse seed alignments? Well, only, uh... the zinc finger transcription factors, for starters, including the classic C2H2 (zf_C2H2) and C3HC4 ring finger (zf_C3HC4) Pfam families. Yup, I'm sure transcription factors would hardly be missed, if they suddenly disappeared from genome-wide protein domain annotation, right?
OK, so no, we can't sweep it under the rug. We're going to take two approaches to rectifying this.
The short term solution is on Rob's side of the pond. Pfam Cambridge is going to rebuild the seed alignments for the problematic families, splitting them into less diverse subfamilies if necessary. Pfam is used to this business of making multiple models for a family; there's already many diverse protein families that Pfam had to represent as several models, such as the many families of G-protein coupled receptors, because of the limitations of HMMER2's sensitivity.
The long term solution is on our side of the pond. We'll bring back glocal alignment for HMMER3. The main challenge there is determining the statistical significance of glocal alignment scores without going back to a compute-intensive simulation to calibrate E-value parameters. There's good work from John Spouge's group we can draw on for this, and some very promising new work on importance sampling from Lee Newberg, and I've got some ideas of my own. I've been talking with both John and Lee about glocal alignment statistics and we should be able to do something together or pairwise on this front. It won't happen soon - I want to get the current work written up and documented first - but that'll be one of the first new fronts that get tackled after my summer o' writing is done.
In the meantime, watch out for missing fingers.