The need for (vector) speed.

A commenter asked yesterday about support for Sun SPARC VIS instructions in HMMER3, and I accidentally deleted the comment in an enthusiastic burst of spam-purging. Have now upgraded WordPress and turned on the akismet spam filter; will see if that helps.

Anyway, in penance for my poor comment moderation: a description of HMMER3’s use of SIMD vector instructions, and a bit of a road map for going forward with support for vector instruction sets beyond Intel/AMD SSE instructions.

HMMER2 scores came from a simple implementation of the Viterbi optimal alignment algorithm — essentially the same as Smith/Waterman local alignment. H2’s Viterbi implementation runs at a speed of 23 Mc/s (more on the unit of measure in a second).

Turns out I made a dumb error in implementing it, and there’s a simple optimization (having to do with cache stride – an issue I thought I was paying attention to), and it’s trivially feasible to get a 3x optimization out of it. The reference Viterbi implementation I’ve used during H3 testing runs at about 61 Mc/s, using code contributed by Jeremy Buhler (Washington University, Computer Science).

What’s “Mc/s”? Mc/s = millions of dynamic programming cells per second; size of query in residues x size of target database in residues / time to run the search in cpu-seconds / 1 million. I use my desktop for timing; it’s an Intel Xeon (Dempsey) 3.2Ghz processor. Because dynamic programming algorithm time scales with the product of the query and target lengths, Mc/s is a query- and target-independent measure of performance.

A major design goal of H3 is to reach about 50% of WU-BLAST speed. I figured if we can get profile HMMs within shouting distance of BLAST, the major limitation of profile HMMs — slow speed — would cease to be an issue.

On typical sized queries, WU BLAST and NCBI BLAST run at about 1600 Mc/s and 9000 Mc/s (but they show a strong query length dependence, getting higher Mc/s on longer queries, so the Mc/s measure isn’t quite as usefully invariant for timing heuristic algorithms.)

A second major design goal of H3 is to use the Forward algorithm for scoring, not Viterbi; i.e. to calculate complete alignment ensembles, not just a single optimal alignment. This should (and does) give us more statistical power in remote homology searches, and it gives us the ability to calculate posterior probabilities for alignment confidence — how reliably HMMER thinks each individual residue is aligned.

The trouble is the two goals are in opposition, because Forward is even slower than Viterbi. H2’s Forward implementation runs at 7 Mc/s. The more optimized reference implementation we’ve used in H3 development (again from Jeremy Buhler) achieves 8 Mc/s.

So that left us (in 2007, when speed work began) with a hefty wall to overcome: a 100x speed differential between reality (8 Mc/s) and hope (the 800 Mc/s design goal).

The 2009 HMMER3 alpha1 release runs at about 2200 Mc/s, beating the design goal by about 3x. How’d it get there?

I used four tricks.

One: use SIMD vector instructions, combined with a striping technique described by Michael Farrar (Bioinformatics 23:156-161, 2007). Vector instructions let you carry out four floating point operations at once, if you can get the algorithm laid out in a way that lets you always do four independent calculations at the same time, and Farrar’s striping trick is a terrific way of laying out a dynamic programming alignment algorithm. You expect to get a 4x speed increase; Viterbi actually went from 61 Mc/s to 320 Mc/s, a 5x improvement, showing just how efficient Farrar’s trick is.

Two: calculate scores in reduced precision, sixteen-fold parallel instead of four-fold, 8bit integers instead of 32bit floats. This incurs some additional overhead, and it overflows to “infinity” on high scoring hits so you have to use it as a filter not a final answer, rescoring high scoring hits with a proper Forward implementation. That got us to 800 Mc/s.

Three: a new heuristic algorithm called MSV, which I’m writing up now in a paper. MSV is again a filter; high scoring hits need to be rescored with the real Forward implementation. MSV by itself runs at about 2900 Mc/s on a typical query, but with a strong query length dependence similar to the BLASTs.

Four: notice you still need to calculate the final answer with a real Forward implementation, so even if you have superfast filters pruning away most low-scoring uninteresting sequences in a database search, you don’t want a slow Forward implementation on the back end. H3’s Forward implementation is four-way vectorized and striped (like Viterbi), and additionally, I use a technique (not yet written up) that I’m calling “sparse rescaling”, in order to keep the Forward implementation in probability space rather than log probability space (as is traditional, even in speech recognition applications of HMM algorithms, for reasons having to do with numerical underflow of floating-point calculations); so H3’s Forward algorithm is using efficient multiply and add instructions, rather than requiring expensive log and exp approximations. H3’s Forward implementation runs at 180 Mc/s, up 25x from our reference implementation.

Then what H3 does, for each target sequence, is process it through a filtering pipeline: does it have a high enough MSV score? OK, does it have a high enough ViterbiFilter score (the 16-fold parallel Viterbi)? OK, does it have a high enough Forward score? OK, it’s a hit; calculate the Backward algorithm too, and postprocess it for domain structure and alignments and biased composition corrections.

Whew. I was intending to tell you about a road map for SIMD vectorization support. What I’ve just told you is all to say, SIMD vectorization is now indispensable in HMMER3. The vectorization is integral to the speed, and completely integral to how the MSV heuristic algorithm works. (A serial implementation of the MSV algorithm doesn’t run much faster than Viterbi.) So suddenly we care a lot about SIMD vector instruction sets.

The main one is Intel/AMD SSE instructions. We support these now; this is what’s in the alpha1 code.

Next we’ll support PowerPC VMX/Altivec instructions. This’ll enable us to run the code on most PowerPC-based platforms, including certain blade servers from IBM. We also need VMX instructions before starting work on a port to the IBM Cell processor.

We’re aware that Sun SPARC uses VIS instructions. We don’t have a Sun SPARC here at Janelia to develop and test on, so we currently have no plan to port to the VIS instruction set, unfortunately. We would welcome assistance.

We’ve just become aware that the IBM Blue Gene‘s processors, though they’re PowerPC chips, don’t support VMX/Altivec instructions, but rather have some sort of two-way vectorization available. We don’t have access to a Blue Gene here, but I’m in contact with folks at IBM who do.

We’re going to provide a “generic” non-SIMD implementation of HMMER3, so the code runs portably anywhere, even if it’s cripplingly slow — this might be a useful baseline for people who are working on a SIMD instruction set we’re not aware of or not planning to support.

Finally, we are paying attention to Intel’s future roadmap. We are prepared to support AVX vector instructions (256-bit wide instead of today’s 128-bit wide) when they appear on Intel 32nm “Sandy Bridge” processors in 2010.

4 thoughts on “The need for (vector) speed.

  1. Hi Sean,

    Thanks for the description. For trick 1 (use SIMD vector instructions) you mention the 5x speed increase for Viterbi. Do you get that for the forward score too?

    Like

  2. It wasn’t possible to do a straight-up vectorization of Forward to compare to Viterbi. Traditionally with Forward one uses a lookup-table-driven approximation to calculate sums of probabilities in log space without underflow, and without calling log() or exp() in the inner loop of a dynamic programming routine. I don’t see how to vectorize those lookups in SSE (I believe it can be done in Altivec/VMX). So I abandoned the usual log-space Forward implementation strategy, and had to find a way to make it work in probabilities, with another way to avoid underflow. That’s the “sparse rescaling” trick. So that’s why the Forward implementation has both striped vectorization and “sparse rescaling”; I don’t have a way to gauge their impact separately.

    Like

  3. Dear Friends
    we will work on scalable 256 FP Vector Array Processor for dedicated to HMMER3 Vector Speed up and will require your help for user and programmer input. Any chance you might be coding using AVX code. Our aims is a little difference from intel and other folks, we basically focus on Scientific research and to make alignment speed up hardware spefically for Gene alignment search using monte carlo approach. Please email me if you guy have any Question.

    Like

  4. We aren’t currently coding AVX, but the code was written with portability to AVX in mind. We don’t have any Intel Sandy Bridge processors yet in our development farm. My understanding (please correct me if I’m wrong) is that Sandy Bridge is currently the only AVX-capable hardware shipping.

    I don’t think I understood the rest of your comment.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s