Cryptic genetic variation in software: hunting a buffered 41 year old bug

In genetics, cryptic genetic variation means that a genome can contain mutations whose phenotypic effects are invisible because they are suppressed or buffered, but under rare conditions they become visible and subject to selection pressure.

In software code, engineers sometimes also face the nightmare of a bug in one routine that has no visible effect because of a compensatory bug elsewhere. You fix the other routine, and suddenly the first routine starts failing for an apparently unrelated reason. Epistasis sucks.

I've just found an example in our code, and traced the origin of the problem back 41 years to the algorithm's description in a 1973 applied mathematics paper. The algorithm -- for sampling from a Gaussian distribution -- is used worldwide, because it's implemented in the venerable RANLIB software library still used in lots of numerical codebases, including GNU Octave. It looks to me that the only reason code has been working is that a compensatory "mutation" has been selected for in everyone else's code except mine.

the bug report

Elena Rivas reported yesterday that our routine for random sampling from a Gaussian distribution, esl_rnd_Gaussian(), had crashed. As with most bug reports on our code, my initial reaction was disbelief. First, there are no bugs in our code. Second, we've had this routine for at least a decade and never seen it fail before. Third, it's not even my code, it's derived from RANLIB, a well-trusted and well-used numerical library.

Then, as with most bug reports on our code, my next reaction was well, goddammit, I guess I'd better look into it. We actually don't sample from Gaussian distributions much (we're more about Gumbels and Dirichlets and gammas, oh my), so that code isn't as tested as a lot of our other code. It's possible there could be a rare problem we hadn't seen until Elena started running some big numerical simulations recently.

finding the bug

Indeed, it's rare, and it took a while to reproduce. I took several billion Gaussian-distributed samples from it before I saw it fail. The numerology of that rarity alone made me suspicious of where the problem might lie. I've seen bugs that fail at a \(\sim 1\) in \(2^{32}\) rate before. One way they happen is when someone's confused about whether a 32-bit integer interval is open or closed... meaning, a number (a 32-bit int normalized to 0..1, say) is said to be "between 0 and 1" but it absolutely matters whether it's [0,1] versus (0,1). And that's where the bug turned out to be.

(If you're not familiar with the notation here, an interval is called closed if it includes the endpoints: [0,1] means \(0 \leq x \leq 1\), 0 and 1 are possible. An interval is open if it does not include the endpoints: (0,1) means \(0 < x < 1\).)

Below is the relevant code snippet from the C source of RANLIB's snorm() function, which is where we derived our code for Gaussian sampling from. snorm() implements a numerical algorithm that transforms a uniform random sample to a Gaussian-distributed sample. It gets its uniform random sample from a routine called r4_uni_01(), which is in the lower-level RNGLIB library.

To see the bug, look at the code below and suppose that your random uniform deviate u "between 0 and 1" happens to get sampled from r4_uni_01() as exactly 0. Now s = 0, and 2*u-s is 0, and 32*0 is 0, and (int)0 is 0, and you branch to the i==0 block; now what the code wants to do is loop, doubling u and incrementing array index i each time until u>1. But if u is 0, doubling u does nothing, and this loop cannot ever terminate normally. The index i increments until it exceeds the range of parameter table d[i], and the code crashes.

u = r4_uni_01 ( );         
if ( u <= 0.5 ) s = 0.0;
else s = 1.0;
u = 2.0 * u - s;
u = 32.0 * u;
i = (int) ( u );
if ( i == 32 ) i = 31;
if ( i != 0 ) { ... stuff happens ... }
else  {
    i = 6;
    aa = a[31];
    for ( ; ; )
      u = u + u;
      if ( 1.0 <= u ) break;
      aa = aa + d[i-1];
      i = i + 1;

So obviously, here it's very important that the uniform random number must be sampled on the (left) open interval (0,1) not the closed interval [0,1), or this routine will (very rarely) fail.

In our code, we'd replaced r4_uni_01 with our random number generator, esl_random(). Why we use our own RNG is a whole 'nother story, about how important it is to have reproducible, high-quality, platform-independent pseudorandom numbers in scientific research code, so our codebases include our own implementation of the fast and reliable Mersenne Twister generator. The esl_random() routine returns a uniform random deviate on the [0,1) interval: it can return exactly 0, but not exactly 1.

Voila: the bug.

it's not my bug

So why the hell did I use a random number sample with the wrong interval endpoints? I don't remember for sure, but I think it's because I actually read the literature.

The algorithm implemented by snorm() is from Extensions of Forsythe's Method for Random Sampling from the Normal Distribution, by J.H. Ahrens and U. Dieter, Mathematics of Computation 27:927-937 (1973). The RANLIB algorithm is "algorithm FT", on page 931, and the relevant snippet is:

Algorithm FT (Forsythe, modified)
1. Generate u. Store the first bit of u as a sign s (s=0 if u<1/2,
s=1 if u>=1/2).  Left-shift u by 1 bit (u = 2u-s). Initialize i=1
and a=0.
2. Left shift u by 1 bit (u = 2u). If u >=1 go to 4.
3. Set a=a+d[i], and then i=i+1. Go back to 2.  ...

That's enough to see that the same bug is potentially in the original paper, depending on what the endpoints of the uniform deviate u are supposed to be. Exactly as in the RANLIB implementation, you can see in step 2 and 3 that the algorithm is trying to double until it is >1, which will never happen if u=0.

So of course the question is, what interval is u supposed to be on? In the preamble above the algorithm, Ahrens and Dieter say "In the description below, generate u indicates taking a sample from the uniform distribution in [0,1)". Voila: I suppose that's why I used a sample on the [0,1) interval. And voila: the bug.

why hasn't all the world's simulation software crashed already?

A schrodinbug is when you're inspecting some code that's been working fine for years, and you suddenly realize that it contains a horrible, obvious bug and it should never have worked in the first place. The moment you observe the schrodinbug, your code instantly stops working throughout the world, and bug reports start streaming in. We've had schrodinbugs before and this might turn out to be one. I'm puzzled why we haven't seen it before.

Even more puzzling, this is code derived from RANLIB, for god's sake... which has been in numerical codebases since the Pleistocene. Why hasn't everything that depends on RANLIB Gaussian samples been crashing?

Or, more terrifyingly -- the fault is rare enough (one in billions) that maybe it has been crashing. Perhaps I should look into this a bit more, I thought.

Obviously, a good guess is that the RNGLIB r4_uni_01() routine must not return exactly 0, so that no problem manifests. So I looked at it:

R4_UNI_01 returns a uniform random real number in [0,1].

Ah. Oops. So much for that.

But then when I looked a little further:

  Discussion:     This procedure returns a random floating point
number from a uniform     distribution over (0,1), not including
the endpoint values, using the     current random number

Um. OK, that's not the first time I've seen someone's documentation contradict itself, and I'm not going to throw any stones in that particular glass house.

Which is it, open or closed interval? To get at that, you've got to read source code, and that eventually leads you to another RNGLIB routine, i4_uni(), which

returns a random integer following a uniform distribution over (1,

which, sigh, is also incorrect; i4_uni() can definitely return exactly 1, so I think they mean the closed interval [1,2147483562]. But it doesn't matter. So long as it doesn't return 0, then r4_uni_01() won't return 0, and it doesn't return 0.

So what I'm saying is, RANLIB's snorm() routine only works in RANLIB because the authors used a routine for random number samples that isn't what the original Ahrens/Dieter (1973) algorithm paper calls for. That's what we geneticists would call a compensatory or suppressor mutation.

I also looked at the RANLIB implementation in GNU Octave. There, the Octave authors have also replaced the random number generator with their own, ranf(), and it happens that ranf() generates on the open interval (0,1), again suppressing the bug in the published algorithm.

Given the pervasive confusion in various documentation over the exact bounds of the intervals of their random numbers, I think it's fair to say that this code has persisted by a process of selection, not by intelligent design. My "mutation" faithfully restored the original intelligent Ahrens/Dieter 1973 algorithm design... and broke the code.

Zuker's schadenfreude

If Michael Zuker is reading this, he will have already gleefully remembered that he sent me an email in 2012:

...I was amused (horrified?) when, while looking "under the hood", I  
discovered that you are using some ancient and obscure method to  
generate Gaussian samples.

and he went on to suggest a cleaner, more modern alternative to me. Which has been on my to do list since. And which, ironically, I've been terrified of implementing, because once you have code that's (apparently) been working for a long time, the last thing you want to do is to go in and start improving it, because man, that's just asking for trouble.

Code is an evolved system. You adopt a new algorithm and all you've done is teleport to a new point in the fitness landscape. You may indeed be near a higher fitness peak, but you aren't likely to be on it yet. You'll have to do a whole tedious evolutionary optimization process all over again to get your new code to be reliable and trusted. It's a wonder anything works, either in genetics, or in software.