snoscan and squid in the 21st century

Back in the 20th century, when Todd Lowe was a happy PhD student in my lab in St. Louis instead of toiling in irons as professor and chair at the Department of Biomolecular Engineering at UCSC, he wrote a program called snoscan for searching the yeast genome sequence for 2'-O-methyl guide C/D snoRNAs (small nucleolar RNAs), using a probabilistic model that combines consensus C/D snoRNA features and a guide sequence complementarity to a predicted target methylation site in ribosomal RNA [Lowe & Eddy, 1999]. Like all the software that's been developed in our group, snoscan has been freely available ever since. Over 30 years, our software, manuscript, and supplementary materials archive has so far survived several institutional moves (Boulder to MRC-LMB to St. Louis to Janelia Farm to Harvard), several generations of revision control system (nothing to rcs to cvs to subversion to git) and is now hosted in the cloud at eddylab.org, with some of the bigger projects also at GitHub.

It's one thing to keep software available. It's entirely another thing to keep it running. Software geeks know the term bit rot: a software package will gradually fall apart and stop running, even if you don't change a thing in it, because operating systems, computer languages, compilers, and libraries evolve.

We got a bug report on snoscan the other day, that it fails to compile on Redhat Linux. (Adorably, the report also asked "how were you ever able to compile it?". Well, the trick for compiling 18-year-old code is to use an 18-year-old system.) Indeed it does fail, in part because the modern glibc library puts symbols into the namespace that weren't in glibc in 1999, causing a C compiler in 2017 to fatally screech about name clashes.

I just spent the day unearthing snoscan-0.9b and whacking it back into compilable and runnable shape, which felt sort of like my college days keeping my decrepit 1972 MG running with a hammer and some baling wire, except that I wasn't cursing Lucas Electrical every five minutes. For those of you who may be trying to build old software from us -- and maybe even for you idealists dreaming of ziplessly, magically reproducible computational experiments across decades -- here's a telegraphic tale of supporting a software package that hasn't been touched since Todd's thesis was turned in in 1999, in the form of a few notes on the main issues, followed by a change log for the hammers and baling wire I just used to get our snoscan distribution working again.

squid is now easel

Since about 2004, much of the software in our group uses a C99 code library called easelwhich roughly means "eddylab sequence library (esl)". Before Easel, we used a library called squid, which roughly means "sequence (something something)". squid development ramped down in 2000-2003 as I decided to replace it with easel, largely because the library had grown to the point that we needed to get much more disciplined about modularization, testing, versioning, and documentation. squid was bundled with a variety of software projects from our group back then, including old versions of HMMER and Infernal. The last squid version we released (and the last version at the head of our internal git repo) is 1.9g, from January 2003. Todd's snoscan, circa 1999, was bundled with version 1.5j.

fixing getline() name clashes

The bug report is that the C compiler fails on sqio.c, complaining of multiple definitions of getline(). This is because getline() is now a function defined in glibc, as well as being defined in squid. I could complain about the glibc project contaminating the C namespace with their goddamned generic function names, except that my squid project is contaminating the C namespace with its goddamned generic function names. (Easel, in contrast, prefixes all its functions with esl_ to avoid this sort of thing.)

The simple fix for this (if you want to play this game at home) is to go into sqio.c and query/replace 33 occurrences of getline() with a better name, something like sqd_getline().

more bit rot

My wife accuses me of being an optimist, and indeed, I totally thought that I would swap out that name in sqio.c, type make, and everything else in snoscan would be roses. What could go wrong? Ha. Ha ha ha.

change log for snoscan-0.9.1 (2017)

In squid:

  • In sqio.c, change getline() to sqd_getline() in 33 places to avoid glibc name clash.
  • In interleaved.c, fix two misplaced parentheses in strncmp() calls. These should have never worked.
  • gnuregex.c, the old GNU regular expression parsing library, is even more bit-rotted than squid. Generates a hopelessly daunting list of compiler warnings. Appears to expect pointers to be portably 32bit?! Replaced with the Henry Spencer hsregex.c code from squid 1.9g, including the Strparse() in hsregex.c.
    • Delete #include "squidconf.h" in hsregex.c; squid 1.5 is Makefile-only, does not use autoconf, which we started using in \~2000-2001.
    • Declaration of Strparse() taken from 1.9g sqfuncs.h.
    • Declaration of sqd_parse[], etc taken from 1.9g squid.h.
    • Modified Strparse() call (argument list) in readGCGdata().
  • removed Strparse() and #include "gnuregex.h" from sre_string.c
  • bumped version from 1.5j to 1.5.11. We have squid under git control, but only its master branch, which (circa 2003) is at 1.9g.

In snoscan:

  • Fix string arg (extra & dereferences), 2x, in ReadMethData(). Should have never worked.
  • In search.c, an array needs to be allocated guidebox[6] not [5]. Should have never worked.
  • Perl's require("getopts.pl") and &Getopts() are obsolete. In sort-snos, replaced with use Getopt::Std and &getopts.
  • Deleted a laughable 18-year-old comment in README about how this is just a beta release and a properly documented and user-friendly version will come Real Soon Now. But left another one in, for posterity.
  • Bumped version to 0.9.1 from 0.9b (which meant "0.9 beta").

snoscan in the 21st century

  • sort-snos has a hardcoded Perl5 path in its first line, /usr/local/bin/perl. On my main system, I have to manually change this to /opt/local/bin/perl.
  • The scan-yeast shell script calls snoscan and sort-snos as commands, which means it assumes you're put them somewhere that's in your \${PATH}. You can also edit to ./snoscan and ./sort-snos to run the shell script in your current directory, which is what I did.
  • snoscan does not have a test suite, and I have only done some limited testing on the examples after making these changes. Although snoscan does appear to run as expected, valgrind report some memory corruption problems. (Awesome and indispensable, valgrind was released in 2002. We test with it routinely now.) Valgrind also detects problems in snoscan 0.9b, so I don't  think the problems are the result of my changes. I was not able to find and fix these memory problems in the time I have; they would require a deeper dive into the code. The output appears to be ok (it's certainly what we used for our 1999 publication, so, um, do you want reproducibility or do you want correctness? you've got reproducibility). It's possible that the memory issues are harmless, but that may be an optimist talking.

new release

snoscan-0.9.1.tar.gz

Let me know about any rotted bits I missed.