In genetics,
*cryptic genetic variation*
means that a genome can contain mutations whose phenotypic effects are
invisible because they are suppressed or buffered, but under rare
conditions they become visible and subject to selection pressure.

In software code, engineers sometimes also face the nightmare of a bug in one routine that has no visible effect because of a compensatory bug elsewhere. You fix the other routine, and suddenly the first routine starts failing for an apparently unrelated reason. Epistasis sucks.

I've just found an example in our code, and traced the origin of the problem back 41 years to the algorithm's description in a 1973 applied mathematics paper. The algorithm -- for sampling from a Gaussian distribution -- is used worldwide, because it's implemented in the venerable RANLIB software library still used in lots of numerical codebases, including GNU Octave. It looks to me that the only reason code has been working is that a compensatory "mutation" has been selected for in everyone else's code except mine.

## the bug report

Elena Rivas reported yesterday that our routine for random sampling
from a Gaussian distribution, *esl_rnd_Gaussian()*, had crashed. As
with most bug reports on our code, my initial reaction was
disbelief. First, there are no bugs in our code. Second, we've had
this routine for at least a decade and never seen it fail
before. Third, it's not even my code, it's derived from RANLIB, a
well-trusted and well-used numerical library.

Then, as with most bug reports on our code, my next reaction was well, goddammit, I guess I'd better look into it. We actually don't sample from Gaussian distributions much (we're more about Gumbels and Dirichlets and gammas, oh my), so that code isn't as tested as a lot of our other code. It's possible there could be a rare problem we hadn't seen until Elena started running some big numerical simulations recently.

## finding the bug

Indeed, it's rare, and it took a while to reproduce. I took several billion Gaussian-distributed samples from it before I saw it fail. The numerology of that rarity alone made me suspicious of where the problem might lie. I've seen bugs that fail at a \(\sim 1\) in \(2^{32}\) rate before. One way they happen is when someone's confused about whether a 32-bit integer interval is open or closed... meaning, a number (a 32-bit int normalized to 0..1, say) is said to be "between 0 and 1" but it absolutely matters whether it's [0,1] versus (0,1). And that's where the bug turned out to be.

(If you're not familiar with the notation here, an interval is called
*closed* if it includes the endpoints: [0,1] means \(0 \leq x \leq 1\), 0
and 1 are possible. An interval is *open* if it does not include the
endpoints: (0,1) means \(0 < x < 1\).)

Below is the relevant code snippet from the C source of RANLIB's
*snorm()* function, which is where we derived our code for Gaussian
sampling from. *snorm()* implements a numerical algorithm that
transforms a uniform random sample to a Gaussian-distributed
sample. It gets its uniform random sample from a routine called
*r4_uni_01()*, which is in the lower-level RNGLIB library.

To see the bug, look at the code below and suppose that your random
uniform deviate *u* "between 0 and 1" happens to get sampled from
*r4_uni_01()* as *exactly* 0. Now *s* = 0, and 2**u*-*s* is 0, and
32*0 is 0, and (int)0 is 0, and you branch to the *i*==0 block; now
what the code wants to do is loop, doubling *u* and incrementing array
index i each time until *u*>1. But if *u* is 0, doubling *u* does
nothing, and this loop cannot ever terminate normally. The index *i*
increments until it exceeds the range of parameter table *d[i]*, and
the code crashes.

```
u = r4_uni_01 ( );
if ( u <= 0.5 ) s = 0.0;
else s = 1.0;
u = 2.0 * u - s;
u = 32.0 * u;
i = (int) ( u );
if ( i == 32 ) i = 31;
if ( i != 0 ) { ... stuff happens ... }
else {
i = 6;
aa = a[31];
for ( ; ; )
{
u = u + u;
if ( 1.0 <= u ) break;
aa = aa + d[i-1];
i = i + 1;
}
```

So obviously, here it's very important that the uniform random number must be sampled on the (left) open interval (0,1) not the closed interval [0,1), or this routine will (very rarely) fail.

In *our* code, we'd replaced *r4_uni_01* with our random number generator, *esl_random()*. Why we use our own RNG is a whole 'nother story, about how important it is to have reproducible, high-quality, platform-independent pseudorandom numbers in scientific research code, so our codebases include our own implementation of the fast and reliable Mersenne Twister
generator. The *esl_random()* routine returns a uniform random
deviate on the
[0,1) interval: it can return exactly 0, but not exactly 1.

Voila: the bug.

## it's not my bug

So why the hell did I use a random number sample with the wrong interval endpoints? I don't remember for sure, but I think it's because I actually read the literature.

The algorithm implemented by *snorm()* is from *Extensions of Forsythe's Method for Random Sampling from the Normal Distribution,*
by J.H. Ahrens and U. Dieter, *Mathematics of Computation* 27:927-937
(1973). The RANLIB algorithm is "algorithm FT", on page 931, and the
relevant snippet is:

```
Algorithm FT (Forsythe, modified)
1. Generate u. Store the first bit of u as a sign s (s=0 if u<1/2,
s=1 if u>=1/2). Left-shift u by 1 bit (u = 2u-s). Initialize i=1
and a=0.
2. Left shift u by 1 bit (u = 2u). If u >=1 go to 4.
3. Set a=a+d[i], and then i=i+1. Go back to 2. ...
```

That's enough to see that the same bug is potentially in the original
paper, depending on what the endpoints of the uniform deviate *u* are
supposed to be. Exactly as in the RANLIB implementation, you can see
in step 2 and 3 that the algorithm is trying to double *u *until it is
>1, which will never happen if *u=*0.

So of course the question is, what interval is *u* supposed to be on?
In the preamble above the algorithm, Ahrens and Dieter say "In the
description below, generate *u* indicates taking a sample from the
uniform distribution in
[0,1)". Voila: I suppose that's why I used a sample on the [0,1) interval. And voila: the bug.

## why hasn't all the world's simulation software crashed already?

A *schrodinbug* is when
you're inspecting some code that's been working fine for years, and
you suddenly realize that it contains a horrible, obvious bug and it
should never have worked in the first place. The moment you observe
the schrodinbug, your code instantly stops working throughout the
world, and bug reports start streaming in. We've had schrodinbugs
before and this might turn out to be one. I'm puzzled why we haven't
seen it before.

Even more puzzling, this is code derived from RANLIB, for god's
sake... which has been in numerical codebases since the
Pleistocene. Why hasn't *everything* that depends on RANLIB Gaussian
samples been crashing?

Or, more terrifyingly -- the fault is rare enough (one in billions)
that maybe it *has* been crashing. Perhaps I should look into this a
bit more, I thought.

Obviously, a good guess is that the
RNGLIB *r4_uni_01()*
routine must not return exactly 0, so that no problem manifests. So I
looked at it:

```
R4_UNI_01 returns a uniform random real number in [0,1].
```

Ah. Oops. So much for that.

But then when I looked a little further:

```
Discussion: This procedure returns a random floating point
number from a uniform distribution over (0,1), not including
the endpoint values, using the current random number
generator.
```

Um. OK, that's not the first time I've seen someone's documentation contradict itself, and I'm not going to throw any stones in that particular glass house.

Which is it, open or closed interval? To get at that, you've got to
read source code, and that eventually leads you to another
RNGLIB routine, *i4_uni()*,
which

```
returns a random integer following a uniform distribution over (1,
2147483562)"
```

which, sigh, is also incorrect; *i4_uni()* can definitely return
exactly 1, so I think they mean the closed interval
[1,2147483562]. But it doesn't matter. So long as it doesn't return
0, then *r4_uni_01()* won't return 0, and it doesn't return 0.

So what I'm saying is, RANLIB's *snorm()* routine only works in RANLIB
because the authors used a routine for random number samples that
isn't what the original Ahrens/Dieter (1973) algorithm paper calls
for. That's what we geneticists would call a compensatory or
suppressor mutation.

I also looked at the RANLIB implementation in GNU Octave. There, the
Octave authors have also replaced the random number generator with
their own, *ranf()*, and it happens that *ranf()* generates on the
open interval (0,1), again suppressing the bug in the published
algorithm.

Given the pervasive confusion in various documentation over the exact bounds of the intervals of their random numbers, I think it's fair to say that this code has persisted by a process of selection, not by intelligent design. My "mutation" faithfully restored the original intelligent Ahrens/Dieter 1973 algorithm design... and broke the code.

## Zuker's schadenfreude

If Michael Zuker is reading this, he will have already gleefully remembered that he sent me an email in 2012:

...I was amused (horrified?) when, while looking "under the hood", I

discovered that you are using some ancient and obscure method to

generate Gaussian samples.

and he went on to suggest a cleaner, more modern alternative to me. Which has been on my to do list since. And which, ironically, I've been terrified of implementing, because once you have code that's (apparently) been working for a long time, the last thing you want to do is to go in and start improving it, because man, that's just asking for trouble.

Code is an evolved system. You adopt a new algorithm and all you've done is teleport to a new point in the fitness landscape. You may indeed be near a higher fitness peak, but you aren't likely to be on it yet. You'll have to do a whole tedious evolutionary optimization process all over again to get your new code to be reliable and trusted. It's a wonder anything works, either in genetics, or in software.