We've had a couple of reports of some less-than-intuitive behavior of HMMER3 on poor-scoring sequences. As one correspondent described it, HMMER3 is stubborn. It will refuse to score and align certain low-scoring sequences no matter what options you try to set. It's probably worth explaining this behavior in public, partly because it's an opportunity for me to briefly describe the fact that H3 has two processing pipelines: the "acceleration" pipeline, and the "domain postprocessing" pipeline. Only the acceleration pipeline is written up for publication, reasonably well documented, and well controllable by options. The domain postprocessor is ad hoc, not terribly satisfying, not well documented, not easily configurable -- and it kicks back a side effect that drops some poor-scoring sequences entirely.
Suppose you want to see a bit score (and alignment) for every sequence in your target database, regardless of how terrible it is. (For instance, maybe you want to plug those H3 scores into some downstream classifier you're writing.)
You'd think, from the way our documentation describes it, that turning on the
--max option would turn off all heuristics, and leave you with just a Forward log-odds likelihood score in bits. That's not quite what
--max does though.
--max only controls the behavior of the acceleration pipeline.
The acceleration pipeline
There's a manuscript that describes the acceleration pipeline. Briefly, the acceleration pipeline consists of four steps. First, the "MSV filter" computes an optimal sum of ungapped local alignments to the profile (the moral equivalent of BLAST's "sum score", achieved by a less heuristic route). Second, the "bias filter" computes a rough correction to all bits scores, depending on the composition of the target; this correction is applied retroactively to sequences that passed the MSV filter, and then to subsequent bit scores too. Third, the "Viterbi filter" computes an optimal sum of gapped local alignments, using a fast reduced-precision (16-bit) score range. Fourth, all sequences that pass these filters are handed to an HMM Forward calculation that computes the sequence's log-odds likelihood bit score, by summing over alignment uncertainty. The Forward bit score is also used as a filter, in that low-scoring sequences are skipped.
P-value thresholds for the MSV, Viterbi, and Forward bit scores are individually controlled by the
--F3 options; setting these to anything >= 1.0 is equivalent to turning off that filter step. The bias filter can be turned off with --nobias. Setting the
--maxoption turns it all off, and is equivalent to
--F1 1.0 --F2 1.0 --F3 1.0 --nobias.
All sequences that pass the acceleration pipeline are going to be handed over to the domain definition pipeline. First, we get some probabilities ready. We already computed the HMM Forward algorithm; now we compute the HMM Backward algorithm. Both the Forward and Backward calculations are done in a linear-memory mode that only saves probabilities for the "special" states (S,N,B,E,C,T) and discards the actual alignment probabilities (M,D,I states). This information is sufficient to compute, in efficient linear memory, by Forward-Backward equations, for every position i in the target sequence, the probability that we are in a homology region or not, and the probability that we start or end that homology region at this position. The domain definition pipeline will use this information.
So far so good -- it's all clean probability theory up to this point. Now it goes all to hell.
The domain definition pipeline
There's no manuscript yet on the domain definition pipeline, partly because I'm not happy with it so I expect it to be substantially revised. But here's what it currently does.
We want to see coordinates for domains, and alignments for those domains. We don't have that -- we have a huge quantity of beautiful uncertainty measures, accounting for the entire uncertain ensemble of all possible alignments to the target sequence. But nobody wants to see that, at least not in their first glance at search output; we want coordinates and alignments -- point estimates. How best to partially collapse the alignment ensemble into some discrete point estimates of where local alignments lie (in terms of start/end coords on the target sequence), and what those alignments are (in terms of residue-by-residue alignment)?
Imagine the following graph along the target sequence (i.e. the x-axis is coordinate position i on the target sequence). We can plot the probability that position i is in a homology region or not: "in" the core profile's homology region (aligned to an M or I state), as opposed to "out" of it (aligned to a flanking N,C, or J state). We can also plot the probability that we begin or end a homology region at each position: the probability of a transition from the B state (begin) at each position, and the probability that we make a transition to the E state at each position. The code refers to these three plots as mocc, bocc, and eocc, for "model occupancy", "begin occupancy", and "end occupancy". These three plots summarize our uncertainty about the position of homology regions on the target.
To convert this beautiful graph into stark point estimates, we use heuristics. Any position with mocc >= 25% (a threshold called "rt1", for "region threshold 1") is called "inside" a homology region. Homology regions are extended left and right until mocc-bocc and mocc-eocc (the probability that this position is in a homology region, but is not the start/end of that homology) drops below 10% (the "rt2" threshold). (In effect, anyway. Actually there's an efficient one-pass algorithm for this, and the code doesn't actually do any left/right extension steps.)
This defines what the code calls "regions". A "region" is a segment i..j that we believe contains probability mass for one or more local alignments.
As you might guess, there's no guarantee that a region contains only one local alignment to the model. We could easily have merged two or more local alignments into one region. But we have a measure of that, from our bocc and eocc plots. If there's a single local alignment, all our "begin" mass precedes all our "end" mass. Indeed, for a well defined single alignment, we expect to see a spike in bocc at the start, and a spike in eocc at the end. If there's more than one local alignment, we'll see spikes of "begin" mass and "end" mass interleaved. We can use bocc and eocc to calculate the expected number of begins following ends: i.e. the expected number of additional local alignments in the region, other than the first one. If that expected number is greater than 0.20 (the "rt3" threshold), we provisionally declare that this region is a "multidomain" region; otherwise we declare that it is a "single domain" region.
For single domain regions, we take the region coordinates we already have, and call those coords the "envelope" of our local alignment.
For multidomain regions, we hand the region over to yet another pipeline, the "stochastic clustering" pipeline. Let's save that routine's description for another day. Basically, we Monte-Carlo sample alignments from the ensemble, cluster them, and sum up the total probability mass (number of samples) in each cluster. For nonoverlapping clusters of sufficient mass, we calculate consensus endpoints (again using probability mass estimates), and declare these to be "envelopes".
The code defines an "envelope" to be a segment i..j that we believe to contains probability mass for one discrete local alignment.
So, what's the local alignment?
Now we pass each envelope, one at a time, to a new Forward/Backward calculation, with the model temporarily reconfigured into "unihit" mode (transitions through the J state prohibited). This time we keep the entire matrices. We combine them, using Forward-Backward equations to compute the posterior probability that each residue i is aligned to each state of the model. We use the maximum expected accuracy (MEA) algorithm to compute, by dynamic programming on the posterior probabilities, a local alignment path that has the maximum sum.
The MEA alignment may have endpoints that lie inside the envelope, because there may be a greater than 50% posterior probability that an edge residue looks like it's nonhomologous (aligned to N or C), as opposed to aligned to the model. Remember, we only required regions to have >25% posterior probability at any one point.
This explains why HMMER reports envelope coords and alignment coords separately, and why they're not necessarily the same. The envelope is a segment in which there seems to be probability mass supporting one local alignment, regardless of what that alignment is. The alignment is a reasonable point estimate for one "best" alignment.
By the way, this also explains why HMMER, phmmer especially, may exhaust your machine's memory on some model/sequence comparisons. The envelope is realigned to the model using full quadratic dynamic programming (Forward, Backward, posterior probability matrix calculation, and MEA alignment), requiring memory proportional to model length M times sequence length L -- roughly 24ML bytes. That's fine for almost everything (the largest Pfam model of M=2200 by the largest sequence of L=45000 might ask for 2.4G RAM and strain you a bit, but most machines have that much RAM). However, if you use phmmer to align titin to titin (45000 by 45000), H3 will try to allocate a total of about 50G of RAM when it gets to its envelope processor, and one of these allocations will fail.
After finding the MEA alignment, it does some other stuff too -- most importantly, it calculates an envelope-specific biased composition correction, using the posterior probability matrices. I'll leave aside the description of that "null2" correction, for now.
At the end, for each envelope, it has a bias-corrected bit score, envelope and alignment coords, the MEA alignment, and posterior probabilities for each residue in the MEA alignment. The domain postprocessor now hands this information back to the main HMMER3 pipeline. Ta-da!
Where the stubbornness comes from
What happens if the target sequence has ZERO regions, envelopes, and alignments?
That can happen. A target sequence that manages to pass the acceleration pipeline thresholds (especially if you have set
--max to make everything pass!) doesn't necessarily have any probability mass sufficient to pass the heuristic thresholds that the domain postprocessor is using to find regions and envelopes. It doesn't take much probability mass to trigger at least one envelope, but still. If the target sequence is sufficiently unalignable, the domain postprocessor may return nothing.
Here I had to make a design decision. Do I want people complaining that they have sequences with scores in the per-sequence output that are missing domains and alignments in the rest of the output? Or do I want people complaining that some sequences are stubbornly ignored regardless of the settings on the acceleration pipeline? I chose the latter: if the domain postprocessor returns zero envelopes, that acts as an additional filter: the sequence is skipped.
Only terrible-scoring sequences are skipped at this stage, so I didn't actually expect any complaints. The only way you'd see the effect, as far as I know, if to turn --max on and check that every sequence is reported in output with some score, however terrible it may be. But it turns out that some people like to see a score for every sequence, either because they're passing those scores into a downstream analysis, or because they're just doing a sanity check on our output.
There's currently no way to turn that behavior off. If you want to hack the code, it should be easy enough: in
src/p7_pipeline.c::p7_Pipeline(), you'll see:
status = p7_domaindef_ByPosteriorHeuristics(sq, om, pli->oxf, pli->oxb, pli->fwd, pli->bck, pli->ddef, NULL); if (status != eslOK) ESL_FAIL(status, pli->errbuf, "domain definition workflow failure"); /* eslERANGE can happen */ if (pli->ddef->nregions == 0) return eslOK; /* score passed threshold but there's no discrete domains here */ if (pli->ddef->nenvelopes == 0) return eslOK; /* rarer: region was found, stochastic clustered, no envelopes found */
The first line is running the domain definition pipeline on sequence
with model. The last two lines are checking that the sequence has at least one region and one envelope. I think you can probably just delete those lines, for example, and the code will report scores for all sequences.
Why it isn't more controllable
The domain postprocessor is inefficient, slow, and kludgy. It uses quadratic-memory dynamic programming routines which can fail on large model/sequence comparisons, especially when you have multiple threads running. It's stupid to recalculate the full Forward/Backward matrices for each region when we've already done it once for the whole sequence (and thrown away most of the result in the name of memory efficiency in the linear-memory acceleration pipeline, memory efficiency we don't have in the domain definition pipeline anyway!). The ad hoc method for resolving multidomain regions is prone to weird edge case failures.
What we ought to be doing is preserving much more information from the Forward/Backward matrices of the entire target sequence (using checkpointing algorithms for example, Tarnas and Hughey 1998), using that information to identify high-probability-mass bands constraining all subsequent steps (in time and memory), and also using that information in less ad hoc calculations in domain postprocessing. That's the focus of my current effort, in fact; right now I'm looking to radically improve the back end of HMMER3, and use that work as a motivator to test, document, add option-controlling code, and write it all up in a more reproducible and rational form that you all can understand and constructively criticize.