mistakes were made

“No one really knows how the game is played
The art of the trade
How the sausage gets made
We just assume that it happens”


A while back Elena Rivas and I posted a response to a bioRxiv preprint (Tavares et al, 2018) that challenged some of the conclusions in our 2017 Nature Methods paper on R-scape, a method for detecting support for conserved RNA secondary structure in sequence alignments by statistical analysis of base pair correlations. At the Benasque RNA meeting last summer, Zasha Weinberg told us we’d made a mistake in our description of how the Weinberg and Breaker R2R program annotates “covarying” base pairs. We’ve just updated our PDF with a correction, and I added an update section to my July blog post to describe the mistake, and our revision. Thanks, Zasha!

Meanwhile, the Tavares et al. manuscript is still up at bioRxiv, with no response to our comment there, despite the fact that we argue that the manuscript is based on an artifact, one of the more spectacular artifacts I have seen in my career. The manuscript describes a method that finds “covariation” support for RNA base pairs in sequence alignments that have no variation or covariation at all.

I’m told that peer review is broken, and that what science really needs is preprint servers and post-publication review. How’s the preprint and open review culture doing in this example? I bet there are people out there citing Tavares et al as support for an evolutionarily conserved RNA structure in the HOTAIR lncRNA, because that’s what the title and abstract says. If people are even aware of our response, I bet they see it as a squabble between rival labs, because lncRNAs are controversial. I bet almost nobody has looked at our Figure 1, which shows Tavares’ method calling “statistically significant covariation” on sequence alignments of 100% identical sequences. I dunno that you’ll ever find a crispier example of post-publication peer review.

In my experience, this is just the way science works, and the way it has always worked. It is ultimately the responsibility of authors, not reviewers, to get papers right. No amount of peer review, pre-publication or post-, will ever suffice to make the scientific literature reliable. Scientists vary in their motivations and in how much they care. Reviews help, but only where authors are willing to listen. A pre-publication peer reviewer can recommend rejecting a paper; the authors can send it elsewhere. A post-publication review can point out a problem; the authors don’t have to pay attention. And who has time to sort out the truth, if the authors themselves aren’t going to?

Most of the literature simply gets forgotten. What matters in the long run is the stuff that’s both correct and important. It’s mostly pointless to squabble about the stuff that’s wrong; point it out politely, sure, but move on. You can’t change people’s minds if they don’t want to hear it.

I guess what I want you to know is that I’m on the side of wanting to get our work right. If you ever see something wrong in one of our papers, or our code, or any product of our laboratory or whatever comes out of my mouth, I want to know about it. Mind you, I won’t necessarily be happy about it – I’m thin skinned, my second grade report card said “DOESN’T TAKE CRITICISM WELL” — but I care deeply about making things right, both in big picture and in every detail.

So again: thanks, Zasha.




Response to Tavares et al., bioRxiv (2018)

A new bioRxiv preprint from Rafael Tavares, Anna Marie Pyle, and Srinivas Somarowthu challenges conclusions of our 2017 Nature Methods paper where we describe R-scape, a method for detecting support for conserved RNA secondary structure in sequence alignments by statistical analysis of base pair covariations. In our paper, among other things, we showed that the evidence presented by Somarowthu (2015) in support of a putative conserved structure for the HOTAIR lncRNA was not statistically significant, using the same alignment that they had analyzed. The new Tavares paper argues that by changing R-scape’s default statistic to a different one called RAFS, now statistically significant evidence for conserved structure is detected in their HOTAIR alignment and others.

Tavares’ conclusions depend on an assumption that the RAFS statistic is an appropriate measure of RNA base pair covariation, but RAFS was not designed to measure covariation alone. RAFS detects positive signals in common patterns of primary sequence conservation in absence of any covariation. The problem is severe;  Tavares’ analysis reports “significantly covarying base pairs” in 100% identical sequence alignments with no variation or covariation. The base pairs that Tavares et al. identify as significantly covarying actually arise from primary sequence conservation patterns. Their analysis still reports similar numbers of  “significant covarying” base pairs in negative controls in which we permute residues in independent alignment columns to destroy covariation. There remains no significant covariation support for evolutionarily conserved RNA structure in the HOTAIR lncRNA or other lncRNA structures and alignments we have analyzed.

We have posted a PDF of a full response to the Tavares et al. preprint on the lab’s web site.

[Update, 15 Nov 2018: We made a correction, marked in red in the PDF, where we describe how the Weinberg and Breaker R2R program annotates “covarying” base pairs. While it’s true that it only “requires a single compensatory pair substitution to annotate a pair as covarying”, it isn’t true that this is “regardless of the number of sequences or the number of substitutions that are inconsistent with the proposed structure.” We have corrected the latter phrase to read “so long as no more than 10% of the sequences are inconsistent with canonical base pairing of the two positions.” We also added a footnote to say that the Somarowthu (2015) paper customized R2R’s tolerance to allow up to 15% inconsistent base pairs to obtain their HOTAIR results. Thank you to Zasha Weinberg for correcting us.]

snoscan and squid in the 21st century

Back in the 20th century, when Todd Lowe was a happy PhD student in my lab in St. Louis instead of toiling in irons as professor and chair at the Department of Biomolecular Engineering at UCSC, he wrote a program called snoscan for searching the yeast genome sequence for 2′-O-methyl guide C/D snoRNAs (small nucleolar RNAs), using a probabilistic model that combines consensus C/D snoRNA features and a guide sequence complementarity to a predicted target methylation site in ribosomal RNA [Lowe & Eddy, 1999]. Like all the software that’s been developed in our group, snoscan has been freely available ever since. Over 30 years, our software, manuscript, and supplementary materials archive has so far survived several institutional moves (Boulder to MRC-LMB to St. Louis to Janelia Farm to Harvard), several generations of revision control system (nothing to rcs to cvs to subversion to git) and is now hosted in the cloud at eddylab.org, with some of the bigger projects also at GitHub.

It’s one thing to keep software available. It’s entirely another thing to keep it running. Software geeks know the term bit rot: a software package will gradually fall apart and stop running, even if you don’t change a thing in it, because operating systems, computer languages, compilers, and libraries evolve.

We got a bug report on snoscan the other day, that it fails to compile on Redhat Linux. (Adorably, the report also asked “how were you ever able to compile it?“. Well, the trick for compiling 18-year-old code is to use an 18-year-old system.) Indeed it does fail, in part because the modern glibc library puts symbols into the namespace that weren’t in glibc in 1999, causing a C compiler in 2017 to fatally screech about name clashes.

I just spent the day unearthing snoscan-0.9b and whacking it back into compilable and runnable shape, which felt sort of like my college days keeping my decrepit 1972 MG running with a hammer and some baling wire, except that I wasn’t cursing Lucas Electrical every five minutes. For those of you who may be trying to build old software from us — and maybe even for you idealists dreaming of ziplessly, magically reproducible computational experiments across decades — here’s a telegraphic tale of supporting a software package that hasn’t been touched since Todd’s thesis was turned in in 1999, in the form of a few notes on the main issues, followed by a change log for the hammers and baling wire I just used to get our snoscan distribution working again.

Continue reading →


In July 2015, our laboratory will move to Harvard University. I’ve accepted an offer to join Harvard’s faculty in two departments: Molecular & Cellular Biology (in the Faculty of Arts and Sciences), and Applied Mathematics (in the School of Engineering and Applied Sciences). The laboratory will be on the first floor of the venerable, historic Biolabs building on the Cambridge campus. Much like we optimized our Janelia location to be the closest lab to the pub, now we will be strategically located closest to the food truck on Divinity Avenue. I’m told that our future space was once occupied by Eric Lander, so who knows what we’re going to find during construction. Charred effigies of Craig Venter, I trust.
Continue reading →

Dfam: annotation of transposable elements with profile HMMs

We’re happy to announce the release of Dfam 1.0, a set of profile HMMs for genomic DNA annotation of transposable elements. This essentially constitutes an upgrade of repeat element annotation from using searches with single sequence consensuses to using searches with profile HMMs, now that the HMMER3 project has made DNA/DNA profile HMM searches sufficiently fast for whole genomes. Dfam is a collaboration between Jerzy Jurka and his Repbase resources (Genetic Information Research Institute), Arian Smit and his RepeatMasker software (Institute for Systems Biology, Seattle), the HMMER3 development team at Janelia Farm (particularly Travis Wheeler, leading nhmmer development), and the Xfam database consortium (particularly Rob Finn, here at Janelia). Among other effects of this work, we expect the widely used RepeatMasker software to include nhmmer, Dfam models, and profile HMM searches in the near future. A preprint of the first Dfam paper is available now on our preprint server, and the database itself is available for use at dfam.janelia.org.
Continue reading →

Infernal 1.1: RNA alignment and database search, 10,000x faster

One of our lab’s goals is to make it possible to systematically search for homologs of RNAs in genomes, not just by looking for sequence conservation but also by looking for RNA secondary structure conservation. A powerful model framework for RNA structure/sequence comparison, called profile stochastic-context free grammars (profile SCFGs), was introduced in the mid-1990s both by Yasu Sakakibara and by us. But profile SCFG methods are among the most computationally intensive algorithms used in genome sequence analysis, requiring (in their textbook description, anyway) O(N^4) time and O(N^3) memory for an RNA of N residues. Profile SCFG implementations like our Infernal software have required immense computational power to get even the most basic sort of searches done.

We are happy to announce a new landmark in our work on these methods, with a new version of Infernal that is about 100x faster than the previous (1.0) version, and 10,000x faster than when Eric Nawrocki started working on making Infernal fast enough for routine use. Over at infernal.janelia.org, Eric has made available the first release candidate of Infernal 1.1, 1.1rc1, including source code and binaries for Linux and MacOS/X. A typical RNA homology search of a vertebrate genome that used to require a cpu-year can now be done in about an hour on a single CPU, or a few seconds on a cluster.

So really for the first time, Infernal has become practical for systematic RNA sequence analysis of whole genomes. Roughly speaking, Infernal 1.1 is running at a speed comparable to what HMMER2 ran at — we’ve brought the RNA search problem down from the utterly ridiculous to the merely difficult.

The next version of the Rfam RNA sequence family database will be the first to be computed entirely natively with Infernal RNA structure comparison, instead of using BLASTN as a prefilter. An all-vs-all comparison of all 2000 Rfam models against the entire EMBL DNA database (170 Gb) would take 30,000 cpu-years using Infernal 0.55; now with Infernal 1.1, that enormous Rfam compute is only going to take us about a day on Janelia’s cluster.

Like Infernal 1.0, 1.1 is achieving its speed by using profile HMMs as heuristic prefilters. Whereas 1.0 used HMMER2-like prefilters, 1.1 has now switched to using HMMER3‘s vector engine, sharing code with Travis Wheeler’s soon-to-be-announced nhmmer program for DNA/DNA comparison.

Happy RNA hunting — and don’t let anyone tell you that O(N^4) algorithms aren’t tractable!