Cryptic genetic variation in software: hunting a buffered 41 year old bug

In genetics, cryptic genetic variation means that a genome can contain mutations whose phenotypic effects are invisible because they are suppressed or buffered, but under rare conditions they become visible and subject to selection pressure.

In software code, engineers sometimes also face the nightmare of a bug in one routine that has no visible effect because of a compensatory bug elsewhere. You fix the other routine, and suddenly the first routine starts failing for an apparently unrelated reason. Epistasis sucks.

I’ve just found an example in our code, and traced the origin of the problem back 41 years to the algorithm’s description in a 1973 applied mathematics paper. The algorithm — for sampling from a Gaussian distribution — is used worldwide, because it’s implemented in the venerable RANLIB software library still used in lots of numerical codebases, including GNU Octave. It looks to me that the only reason code has been working is that a compensatory “mutation” has been selected for in everyone else’s code except mine.

The bug report

Elena Rivas reported yesterday that our routine for random sampling from a Gaussian distribution, esl_rnd_Gaussian(), had crashed. As with most bug reports on our code, my initial reaction was disbelief. First, there are no bugs in our code. Second, we’ve had this routine for at least a decade and never seen it fail before. Third, it’s not even my code, it’s derived from RANLIB, a well-trusted and well-used numerical library.

Then, as with most bug reports on our code, my next reaction was well, goddammit, I guess I’d better look into it. We actually don’t sample from Gaussian distributions much (we’re more about Gumbels and Dirichlets and gammas, oh my), so that code isn’t as tested as a lot of our other code. It’s possible there could be a rare problem we hadn’t seen until Elena started running some big numerical simulations recently.

Finding the bug

Indeed, it’s rare, and it took a while to reproduce. I took several billion Gaussian-distributed samples from it before I saw it fail. The numerology of that rarity alone made me suspicious of where the problem might lie. I’ve seen bugs that fail at a ~1 in 2^32 rate before. One way they happen is when someone’s confused about whether a 32-bit integer interval is open or closed… meaning, a number (a 32-bit int normalized to 0..1, say) is said to be “between 0 and 1” but it absolutely matters whether it’s [0,1] versus (0,1). And that’s where the bug turned out to be.

(If you’re not familiar with the notation here, an interval is called closed if it includes the endpoints: [0,1] means 0 <= x <= 1, 0 and 1 are possible. An interval is open if it does not include the endpoints: (0,1) means 0 < x < 1.)

Below is the relevant code snippet from the C source of RANLIB’s snorm() function, which is where we derived our code for Gaussian sampling from. snorm() implements a numerical algorithm that transforms a uniform random sample to a Gaussian-distributed sample. It gets its uniform random sample from a routine called r4_uni_01(), which is in the lower-level RNGLIB library.

To see the bug, look at the code below and suppose that your random uniform deviate u “between 0 and 1” happens to get sampled from r4_uni_01() as exactly 0. Now s = 0, and 2*us is 0, and 32*0 is 0, and (int)0 is 0, and you branch to the i==0 block; now what the code wants to do is loop, doubling u and incrementing array index i each time until u>1. But if u is 0, doubling u does nothing, and this loop cannot ever terminate normally. The index i increments until it exceeds the range of parameter table d[i], and the code crashes.

u = r4_uni_01 ( );         
if ( u <= 0.5 ) s = 0.0;
else s = 1.0;
u = 2.0 * u - s;
u = 32.0 * u;
i = (int) ( u );
if ( i == 32 ) i = 31;
if ( i != 0 ) { ... stuff happens ... }
else  {
    i = 6;
    aa = a[31];
    for ( ; ; )
    {
      u = u + u;
      if ( 1.0 <= u ) break;
      aa = aa + d[i-1];
      i = i + 1;
    }

So obviously, here it’s very important that the uniform random number must be sampled on the (left) open interval (0,1) not the closed interval [0,1), or this routine will (very rarely) fail.

In our code, we’d replaced r4_uni_01 with our random number generator, esl_random(). Why we use our own RNG is a whole ‘nother story, about how important it is to have reproducible, high-quality, platform-independent pseudorandom numbers in scientific research code, so our codebases include our own implementation of the fast and reliable Mersenne Twister generator. The esl_random() routine returns a uniform random deviate on the [0,1) interval: it can return exactly 0, but not exactly 1.

Voila: the bug.

It’s not my bug

So why the hell did I use a random number sample with the wrong interval endpoints? I don’t remember for sure, but I think it’s because I actually read the literature.

The algorithm implemented by snorm() is from Extensions of Forsythe’s Method for Random Sampling from the Normal Distribution, by J.H. Ahrens and U. Dieter, Mathematics of Computation 27:927-937 (1973). The RANLIB algorithm is “algorithm FT”, on page 931, and the relevant snippet is:

Algorithm FT (Forsythe, modified)
1. Generate u. Store the first bit of u as a sign s (s=0 if u<1/2, s=1 if u>=1/2).
Left-shift u by 1 bit (u = 2u-s). Initialize i=1 and a=0.
2. Left shift u by 1 bit (u = 2u). If u >=1 go to 4.
3. Set a=a+d[i], and then i=i+1. Go back to 2.
...

That’s enough to see that the same bug is potentially in the original paper, depending on what the endpoints of the uniform deviate u are supposed to be. Exactly as in the RANLIB implementation, you can see in step 2 and 3 that the algorithm is trying to double until it is >1, which will never happen if u=0.

So of course the question is, what interval is u supposed to be on? In the preamble above the algorithm, Ahrens and Dieter say “In the description below, generate u indicates taking a sample from the uniform distribution in [0,1)”. Voila: I suppose that’s why I used a sample on the [0,1) interval. And voila: the bug.

Why hasn’t all the world’s simulation software crashed already?

A schrodinbug is when you’re inspecting some code that’s been working fine for years, and you suddenly realize that it contains a horrible, obvious bug and it should never have worked in the first place. The moment you observe the schrodinbug, your code instantly stops working throughout the world, and bug reports start streaming in. We’ve had schrodinbugs before and this might turn out to be one. I’m puzzled why we haven’t seen it before.

Even more puzzling, this is code derived from RANLIB, for god’s sake… which has been in numerical codebases since the Pleistocene. Why hasn’t everything that depends on RANLIB Gaussian samples been crashing?

Or, more terrifyingly — the fault is rare enough (one in billions) that maybe it has been crashing. Perhaps I should look into this a bit more, I thought.

Obviously, a good guess is that the RNGLIB r4_uni_01() routine must not return exactly 0, so that no problem manifests. So I looked at it:

R4_UNI_01 returns a uniform random real number in [0,1].

Ah. Oops. So much for that.

But then when I looked a little further:

  Discussion:
    This procedure returns a random floating point number from a uniform 
    distribution over (0,1), not including the endpoint values, using the
    current random number generator.

Um. OK, that’s not the first time I’ve seen someone’s documentation contradict itself, and I’m not going to throw any stones in that particular glass house.

Which is it, open or closed interval? To get at that, you’ve got to read source code, and that eventually leads you to another RNGLIB routine, i4_uni(), which

returns a random integer following a uniform distribution over (1, 2147483562)"

which, sigh, is also incorrect; i4_uni() can definitely return exactly 1, so I think they mean the closed interval [1,2147483562]. But it doesn’t matter. So long as it doesn’t return 0, then r4_uni_01() won’t return 0, and it doesn’t return 0.

So what I’m saying is, RANLIB’s snorm() routine only works in RANLIB because the authors used a routine for random number samples that isn’t what the original Ahrens/Dieter (1973) algorithm paper calls for. That’s what we geneticists would call a compensatory or suppressor mutation.

I also looked at the RANLIB implementation in GNU Octave. There, the Octave authors have also replaced the random number generator with their own, ranf(), and it happens that ranf() generates on the open interval (0,1), again suppressing the bug in the published algorithm.

Given the pervasive confusion in various documentation over the exact bounds of the intervals of their random numbers, I think it’s fair to say that this code has persisted by a process of selection, not by intelligent design. My “mutation” faithfully restored the original intelligent Ahrens/Dieter 1973 algorithm design… and broke the code.

Zuker’s schadenfreude

If Michael Zuker is reading this, he will have already gleefully remembered that he sent me an email in 2012:

…I was amused (horrified?) when, while looking “under the hood”, I  
discovered that you are using some ancient and obscure method to  
generate Gaussian samples.

and he went on to suggest a cleaner, more modern alternative to me. Which has been on my to do list since. And which, ironically, I’ve been terrified of implementing, because once you have code that’s (apparently) been working for a long time, the last thing you want to do is to go in and start improving it, because man, that’s just asking for trouble.

Code is an evolved system. You adopt a new algorithm and all you’ve done is teleport to a new point in the fitness landscape. You may indeed be near a higher fitness peak, but you aren’t likely to be on it yet. You’ll have to do a whole tedious evolutionary optimization process all over again to get your new code to be reliable and trusted. It’s a wonder anything works, either in genetics, or in software.

15 thoughts on “Cryptic genetic variation in software: hunting a buffered 41 year old bug

  1. Unless I’m too confused by the mess of open and closed intervals here (or I simply misunderstood you), I think there are two mistakes in the text:
    “so I think they mean the closed interval [1,2147483562]” should say “so I think they mean the open interval [1,2147483562]” and
    “ranf() generates on the open interval (0,1)” should be “ranf() generates on the closed interval (0,1)”.
    Right?

    Like

    • [citation needed]!

      ]a,b[ I’ve seen for open interval. For example, here. But I’ve never seen (a,b) mean closed interval. That’s just horrible & wrong & backwards, surely.

      Like

  2. Interesting stuff that you’re dealing with! Thank’s for sharing:)

    I’m wondering though…

    The bug that you describe seems to be “fail-fast”, right? By that I don’t mean that occurs often and is fast to find. Rather I mean: once the program enters the erronous state (i.e. u is set to 0 in your algorithm) the failure will always become apparent right away and cannot propagate to other parts of the program unnoticed (here: the algorithm will always get stuck in an infinate loop right away and never return a bogus value). So “fail-fast” is a good thing.

    The compensatory bug in r4_uni_01() of RNGLIB on the other hand is not “fail-fast” in general. The faulty implementation may or may not return a value that contradicts its documentation (i.e. 0). However that does not necessarily become apparent to the caller immediately. So the erronous state can propagate to other parts of the program — and may cause seemingly unrelated trouble eventually.

    One of the worst case scenarios:

    r4_uni_01() is used by some program, where the absence of 0 is strictly expected for true randomness. Maybe the program itself does work without ever failing, even if 0 is introduced. But the scientific results that it produces may slightly skewed. Just because they depended on a truly random distribution, but instead got a slightly distorted one. Can you imagine that this might have affected research in genome sequence analysis?

    My naive guess would be: probably not if you work with real numbers. Any discrete representation of reals introduces far more potential for distortion. So the numeric algorithms that further process these reals are designed to deal with that. What do you think?

    Or worse yet?

    Random integers (like the ones produced by i4_uni()) on the other hand are also used for stuff like cryptography. There, a truly random distribution is often essential. Off-by-one problems (e.g. open vs. closed intervals) might leave the resulting cryptographic algorithims vulnerable. Any chance that RNGLIB is being used for such applications? Or might there be similar bugs slumbering in other libs for decades?

    Liked by 1 person

    • @michael: Heartbleed was such a long duration bug. In this case, it looks like Sean and team caught a bug in code + docs, where the docs didn’t accurately represent what the code was doing, nor for that matter accurately represent what the algorithm was supposed to do. Are there other such out there? Yes, and I am sorry to say that no one writes error free code. All you can do is test, and make sure your test coverage includes your implicit and explicit assumptions. When you distribute your code, you should also distribute your tests, as bugs in your tests can mask bugs in your code. Is RNGLIB used in security apps? Not likely … there’s a class of PRNGs that are good for research codes, but not so good for cryptographic codes. The class of cryptographically viable PRNG are far smaller than the research quality PRNG (which are themselves a small subset of all PRNG). RNGLIB isn’t quite up to cryptographic requirements. c.f. http://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator

      Like

  3. Pingback: Cryptic genetic variation in software:… « Another Word For It

  4. “Code is an evolved system. You adopt a new algorithm and all you’ve done is teleport to a new point in the fitness landscape. ”

    I wish every new programmer (my younger self included) could have caught this one sentence before they got into “but I know a better way and it’s SO much cleaner/faster/more elegant/…”. Puts the potential cost in perfect perspective.

    Like

  5. “It’s a wonder anything works, either in genetics, or in software.”

    I’ve had this perspective for years. So many people get mad when their car won’t start, or say “why that person?” when they come down with some non-communicable disease. But I’m the opposite. I’m amazed at how often my car starts with no problem, and how often creatures live their lives with no major malfunctions. It seems to be against the odds to me.

    If you don’t wonder how anything manages to work at all, I figure you aren’t really paying attention.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s