A different view on search results

Have you ever wondered how a new protein family would look in context of other Pfam domains? Well, look no further than the hmmer website!  At the end of last week we released a new way of visualizing search results according to ‘domain architecture’ (applies to both phmmer and hmmsearch).

A ‘domain architecture’ is the list of protein domains that are found on the protein sequence. Protein domains can be combined in different ways to give rise to functional diversity, consequently functionally related sequences have the same, or very similar, domain architectures.  These architectures can be used as a ‘signature’ for grouping sequences together.

This type of view has been used for a long time in Pfam.  For example the CBS domain is commonly found associated with enzymatic domains, membrane transporters or DNA-binding domains, as well as being found in isolation.  The CBS domain frequently occurs as pairs of domains and can regulate enzymatic and transporter domains that it co-occurs with, in response to binding the adenosyl moiety in molecules such as AMP and ATP.  To illustrate some of the features in this new view on the hmmer server, the Pfam 25.0 CBS model has been searched (hmmsearch) against the UniProt sequence database using E-value cut-offs (sequence 0.01, domain 0.03), rather than using Pfam defined gathering thresholds.  The CBS domain is found in over 22,000 sequences, with varying numbers of copies of the domain and all kingdoms of Life represented.

The precise taxonomic breakdown can be visualized using the ‘taxonomy’ view -but more about this view another time.  If we go straight on to the domain architecture view, as shown in figure 2, the results are far more digestible.

The first 5 architectures summarize over 50% of the sequences (Figure 2, the full page is here - until we need the disk space).  The position of the match between the CBS HMM and the target sequence is shown by the black box(es) under the domain graphic.  As expected, most of the time there is perfect agreement between Pfam domains on the graphic and the match position. The most common architecture in our results is a pair of CBS domains.  The second most common domain architecture is ‘DUF21, 2x CBS, CorC_HlyC’.  Rather tantalizingly, the seventh most common domain architecture is ‘DUF21, CBS, CorC_HlyC’ – with just enough room for another CBS domain.  Selecting the ‘Show All’ for this architecture reveals a graphical representation of the first 40 out 817 sequences in this set.    Multiple sequences contain two matches to our model, indicating that the Pfam model, used in default mode, is not quite sensitive enough to pick up on both copies of the domain.

The domain architecture view has also proved to be useful when searching less well characterized sequences, as you do when building novel protein families. Two recent examples illustrate the use of the domain architecture view.  In the first example, Sean and I were looking at the C-terminal part of oskar, a protein in Drosophila that is involved in germ cell formation. What started out to be an uncharacterized region, soon became clearly related to DUF303 and other esterases found in Pfam due to the overlaps between hit positions and domains.  In the second example, I started with TAF1B_HUMAN, lacking the N-terminalRRN7 domain, and performed a phmmersearch (Figure 4).

This query sequence had no Pfam domains annotated on it, but did match 59 other Eukaryotic sequences.  From here, I downloaded the aligned FASTA file (and after a little futzing around with the family using Jalview) searched the alignment using hmmsearch.  This analysis has produced a Eukaryotic specific domain, found on just under 200 sequences (full results are here, until they are removed).  It is found associated with the RRN7 domain in 50% of sequences, with many lacking that domain looking like they may might contain a RRN7-related domain.  Now off to the literature to see if I can dig up some functional information.  Hopefully, that will shed some light on the one overlap with an existing Pfam domain, HobA - I have my suspicions....

Overall, this new view provides an interesting, alternative way of visualizing results from a search against the hmmer server.  With so many sequences, we understand the need to group and filter results. We hope you find the new view useful!