RNA secondary structure prediction with probability models

Over at our publications page, I’ve posted a preprint of Elena Rivas’ latest paper on RNA secondary structure prediction, which she submitted for review today.

The state of the art in RNA secondary structure prediction is arguably starting to move towards statistical models, instead of (or in addition to) thermodynamic models. Statistical models can be estimated on large RNA databases like Rfam and can fairly naturally incorporate probabilistic constraints generated by experiments such as protection assays.

The best statistical models for RNA folding to date have been discriminatively trained models like CONTRAfold from Chuong Do and Serafim Batzoglou. One thing that caught our eye in the excellent Do and Batzoglou paper was the claim that discriminative training was used because the literature indicated it wasn’t feasible to train good generative models (SCFGs). Our lab is responsible for some of the papers Do and Batzoglou cited, and the thing is, in those papers, we never really tried to make a good SCFG for single sequence RNA folding, with realism/complexity on the order of CONTRAfold or the thermodynamic nearest neighbor model — we were always after something else, like RNA genefinding, where we deliberately compromised against other design constraints and made simpler SCFGs. So here, Elena describes a range of SCFGs for probabilistic RNA folding.

The result ends up being a little discouraging to us: SCFGs and discriminative models like CONTRAfold pretty much perform the same, slightly but not a lot better than thermodynamic models, even though we explored a fair number of models even more complex than the current state of the art. (Which in principle is one of the main attractions of statistical models — because we can retrain the entire model easily on data, we’re not as bound to several decades of careful, focused biophysical experiments as the thermodynamic parameter sets are.) We found that it’s super-easy to overtrain these models, that Rfam is pretty problematic as a source of individual RNA secondary structures for training these models, and that the standard training/test paradigm we’ve seen in many people’s papers is vulnerable to problems because of a lack of structural diversity in our current datasets. (Separating your training and test sequences by sequence dissimilarity isn’t enough; you really need to test on structures entirely nonhomologous to any of your training structures.) Our main conclusion is that these methods are promising, but the single most limiting thing is the lack of good, large, diverse databases of individual RNA secondary structures.

A personal aside on this paper — our coauthor is Ray Lang, chair of computer science at Xavier University of Louisiana, in New Orleans. Ray spent the fall of 2005 in our lab in Saint Louis, after Xavier University was devastated by Hurricane Katrina, as part of a program that HHMI sponsored to place Xavier faculty temporarily in Hughes research labs. Ray lost a lot to Katrina. It’s nice to see one small bright thing come from that fall. He started some of the work Elena describes, and it’s terrific to see this manuscript head for publication at last.


  1. Thanks for posting this. Your comments about the need for diverse training structures chime with my (less systematic) anecdotal experience. And I too have been skeptical of claims that generative models are insufficient to the tasks of RNA structure prediction and evolutionary reconstruction. These results suggest, at least, that they are no less sufficient than discriminative models, at least for the structural part. Good luck in review, and thanks for the preprint & backstory!



  2. Exciting stuff. I’m only just catching up on this. I also like reading back-stories. For TestSetB, did you check that the Rfam structure and the 2D structure(s) parsed from the 3D structure(s) are consistent? In an ideal world this is the case, but is not guaranteed. I’ve just been looking at the Rfam 5S rRNA model and realise that it certainly needs some work.

    I guess it’s well outside the scope of the article, but it’d be fun to try a ‘Next-to-nearest-neighbour’ model. I remember Mike Zuker mention those once. We played around a little bit with this when trying to improve on the mutual information measure with Stinus & Anders. N2NN improved the accuracy but we couldn’t justify why it worked or our extremely ad-hoc (even for us) weighting scheme.



  3. I forgot to mention. Is it possible to use some the large-scale structure probing work for benchmarking/training too? Eg. Kevin Weeks’ papers on HIV. The amount of data is probably still to small yet seems promising.



  4. Hi Paul,

    TestSetB uses Rfam structures. We discussed this last time you visited janelia, and it is the big problem. We don’t seem to have many 2D structures parsed from 3D structures for a large and diverse number of RNA molecules.

    Regarding next-to-nearest-neighbour models, that is one of the points of the paper. We tested many different beyond the nn model features. The problems (relates to above) is that we don’t have enough data to trains these models
    which pretty quickly add many parameters. It turns out, we don’t have enough data to train nn models without quickly overtraining.

    I agree, Low&Weeks, as well as Underwood et al. and Kertesz et al. probing techniques could eventually become good soruces for trusted RNA secondary structures.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s