RNA secondary structure prediction with probability models

Over at our publications page, I've posted a preprint of Elena Rivas' latest paper on RNA secondary structure prediction, which she submitted for review today.

The state of the art in RNA secondary structure prediction is arguably starting to move towards statistical models, instead of (or in addition to) thermodynamic models. Statistical models can be estimated on large RNA databases like Rfam and can fairly naturally incorporate probabilistic constraints generated by experiments such as protection assays.

The best statistical models for RNA folding to date have been discriminatively trained models like CONTRAfold from Chuong Do and Serafim Batzoglou. One thing that caught our eye in the excellent Do and Batzoglou paper was the claim that discriminative training was used because the literature indicated it wasn't feasible to train good generative models (SCFGs). Our lab is responsible for some of the papers Do and Batzoglou cited, and the thing is, in those papers, we never really tried to make a good SCFG for single sequence RNA folding, with realism/complexity on the order of CONTRAfold or the thermodynamic nearest neighbor model -- we were always after something else, like RNA genefinding, where we deliberately compromised against other design constraints and made simpler SCFGs. So here, Elena describes a range of SCFGs for probabilistic RNA folding.

The result ends up being a little discouraging to us: SCFGs and discriminative models like CONTRAfold pretty much perform the same, slightly but not a lot better than thermodynamic models, even though we explored a fair number of models even more complex than the current state of the art. (Which in principle is one of the main attractions of statistical models -- because we can retrain the entire model easily on data, we're not as bound to several decades of careful, focused biophysical experiments as the thermodynamic parameter sets are.) We found that it's super-easy to overtrain these models, that Rfam is pretty problematic as a source of individual RNA secondary structures for training these models, and that the standard training/test paradigm we've seen in many people's papers is vulnerable to problems because of a lack of structural diversity in our current datasets. (Separating your training and test sequences by sequence dissimilarity isn't enough; you really need to test on structures entirely nonhomologous to any of your training structures.) Our main conclusion is that these methods are promising, but the single most limiting thing is the lack of good, large, diverse databases of individual RNA secondary structures.

A personal aside on this paper -- our coauthor is Ray Lang, chair of computer science at Xavier University of Louisiana, in New Orleans. Ray spent the fall of 2005 in our lab in Saint Louis, after Xavier University was devastated by Hurricane Katrina, as part of a program that HHMI sponsored to place Xavier faculty temporarily in Hughes research labs. Ray lost a lot to Katrina. It's nice to see one small bright thing come from that fall. He started some of the work Elena describes, and it's terrific to see this manuscript head for publication at last.