Extracting HMMER results to sequence files: Easel miniapplications

Easel logo

The HMMER and Infernal code includes some hidden tools: the Easel library, and its “miniapplications”. Easel is our code library (in the easel subdirectory of both HMMER and Infernal), and the miniapplications (in easel/miniapps) are a set of command line utilities that we use for manipulating sequence data. For example, esl-reformat is a utility for reformatting from one sequence file format to another, and esl-sfetch is a tool for retrieving sequence(s) or subsequence(s) from a large sequence flatfile. These utilities work together with HMMER and Infernal to enable sequence analysis in a flexible, arcane, unix-y command line sort of way.

For example, yesterday someone wrote to ask, suppose I want to extract all the sequences that were hit by a HMMER hmmsearch, and save them in a separate file in FASTA format — how do I do that? This is a good example for introducing Easel’s miniapplications.

Saving alignments from searches

One way to extract the hits is with the -A option to the search programs (hmmsearch, phmmer, jackhmmer). This saves all the domains (subsequences) that were significant (passed the “inclusion thresholds”) as one alignment, in Stockholm format. Then we can use esl-reformat to convert the alignment to FASTA format:

hmmsearch -A myhits.sto tutorial/fn3.hmm uniprot_sprot.fasta
esl-reformat fasta myhits.sto > myhits.fa

That only extracts the domains that matched the model — what if we want the entire original sequences? And what if we only want a subset of those sequences, filtered by some crazy criterion of our own?

esl-sfetch: extract (sub)sequence(s) from a large sequence flatfile

The esl-sfetch program (in easel/miniapps) is for extracting arbitrary subsets of sequences or subsequences from a sequence file. The first thing you need to do to use esl-sfetch on a file is to create an “SSI index” for that file:

esl-sfetch --index uniprot_sprot.fasta

Now you have a file uniprot_sprot.fasta.ssi, a binary index for your sequence datafile.

To extract a single sequence by name or accession:

esl-sfetch uniprot_sprot.fasta "sp|A2ASS6|TITIN_MOUSE"

(Note that esl-sfetch doesn’t try to parse the semantics of a FASTA id; it doesn’t know that sp|A2ASS6|TITIN_MOUSE is a concatenation of the SwissProt accession A2ASS6 and the SwissProt identifier TITIN_MOUSE. We’ll probably change that in the future, because it’s a hassle not to be able to use the accession or identifier alone when FASTA name/description lines are constructed this way. Also note that because of the | symbols, this name has to be double-quoted on the commandline.)

To extract lots of sequences at the same time, esl-sfetch -f takes a file of sequence identifiers instead of a single identifier. For example if the file myhits.list consists of three names:

% cat myhits.list
sp|Q8WZ42|TITIN_HUMAN
sp|P04937|FINC_RAT
sp|Q60847|COCA1_MOUSE

then we can fetch them all at once like this:

esl-sfetch -f uniprot_sprot.fasta myhits.list > myhits.fasta

esl-sfetch can read from a pipe

Now for a neat trick. esl-sfetch can read its input file from a pipe, if you use ‘-‘ as an argument instead of a file name. This means you can do any unix-y, perl-y, awk-y command line invocation you want, and if it results in a list of sequence identifiers, you can pipe it into esl-sfetch.

For example, suppose in the question we started with, we want to extract the original (full-length) sequences that contain one or more HMMER hits. We can first save a tabular HMMER output file, then manipulate that tabular data to create the fetching list we want. (Remember that HMMER’s tabular output files contain comment lines starting with a # symbol, so invocations typically begin with grep -v "^#" to remove comments.)

hmmsearch --tblout myhits.tbl tutorial/fn3.hmm ~/data/uniprot_sprot.fasta
grep -v "^#" myhits.tbl | awk '{print $1}' | esl-sfetch -f uniprot_sprot.fasta - > myhits.fa

esl-sfetch is efficient. This example retrieves 729 sequences from SwissProt in about 300 milliseconds, on my Mac desktop.

esl-sfetch for lots of subsequences

With the -C option in addition to -f, the input file for esl-sfetch consists not just of a list of names, but also includes start/end coordinates of a subsequence. Specifically, each line has the (whitespace-delimited) format with four fields:

new_name start end source_identifier

So, suppose you didn’t want to fetch the complete sequences, but you wanted to fetch the subsequences that got hit in a HMMER search, using their “envelope” coordinates. We could start with a domain tabular output (–domtblout) and fetch as follows:

hmmsearch --domtblout myhits.dtbl tutorial/fn3.hmm ~/data/uniprot_sprot.fasta
grep -v "^#" myhits.dtbl | awk '{print $1"/"$20"-"$21, $20, $21, $1}' | esl-sfetch -Cf uniprot_sprot.fasta - > myhits.fa

That example retrieves 2689 subsequences in about a second, on my Mac.

for more information

We don’t consider Easel to be quite ready for prime time yet, which is why we haven’t said much about it. But it’s getting there, and we welcome feedback (and code contributions!) for making it better. The best place to learn more about the Easel miniapps is to poke around in the source code directory (for example, HMMER’s easel/miniapps), where you’ll find UNIX man pages for each miniapp. Read ’em with, for example:

man ./esl-alistat.man

Each of them will also dump a short help page if you give them a -h option:

esl-alistat -h

17 thoughts on “Extracting HMMER results to sequence files: Easel miniapplications

  1. Hooray esl-sfetch! For as “simple” as it seems, it is the fastest, most robust, full-featured such tool I’ve come across. It’s also always installed, b/c you can’t live without HMMER right?

    In fact, just made a bunch of MCL clusters using pHMMER alignments, and I have esl-sfetch running on the cluster, making little FASTAs of each to run through MUSCLE/hmmbuild. Reading this now is like eating while watching Food Network.

    esl-reformat, esl-selectn, and esl-shuffle are gems too.

    Thank you for this Sean & crew!

    Like

  2. When proteins that need to be HMMaligned, each possess multiple copies of the domain that needs to be aligned, can HMMalign deal with it?
    Or will I have to perform some parsing in order to visualize the data in a human-friendly way?
    Thanks – A

    Like

  3. This is covered in the user’s guide, isn’t it?

    hmmalign assumes that each input sequence contains a single region of homology to the model. It aligns each input sequence globally with respect to the sequence, and locally with respect to the model (i.e. the sequence can be a fragmentary match to the model). If the target sequence contains two or more domains, these subsequences need to be extracted before handing them to hmmalign for multiple alignment.

    But in part, that’s the purpose of the hmmsearch -A option. If you know your target sequences might contain more than one domain, use hmmsearch -A to save an alignment of the “included” domains that hmmsearch detects. This is the same as extracting each included domain and passing them to hmmalign.

    Like

  4. Great post. Very interested to see these for others in the easel package (in your copious spare time). One issue I ran across recently is that the interface for esl-alimanip changed unexpectedly (initially used easel in infernal, but switched to hmmer3), and I couldn’t use the –start-all and –end-all args to extract specific columns from a stockholm file (they disappeared in the hmmer3 version). Is there a new way to achieve the same behaviour with the hmmer3 versions?

    Like

  5. But it seems as if esl-reformat only can convert from Stockholm format, not to Stockholm format?

    esl-reformat stockholm tmp.fasta

    “parse failed (line 1): missing “# STOCKHOLM” header”

    Like

  6. Well, two problems with that:
    1. you can’t convert an unaligned (fasta) file to a multiple alignment format (stockholm); you can convert alignments to unaligned sequences, or alignments to alignments. To get an alignment from unaligned sequences, you have to perform a multiple sequence alignment, not just change the format.

    2. In the current public releases of hmmer/infernal/easel, there’s no autodetection of alignment formats, so it assumes that all input alignment files are in Stockholm format unless you explicitly say otherwise, using an –informat option. (This will change in the next release; we have good autodetection for all our formats now, both alignments and unaligned files.)

    So if “tmp.fasta” is an aligned FASTA file (afa format), not just a plain FASTA file, you should be able to:

    % esl-reformat –informat afa stockholm tmp.fasta

    Like

  7. It appears that esl-alistat does not calculate per-column weights for afa (fasta) format, but rather just reformats the weight columns from stockholm format, correct? I keep getting a segfault if I try to output (calculate?) per-column weights for aligned fasta; I know easel isn’t ready for prime-time, but perhaps a tiny clue as to the reason for the faulting might be helpful in an actual release.

    Thanks,
    Jed

    Like

  8. You’ll have to say more. I don’t understand. esl-alistat doesn’t calculate sequence weights, afa files don’t have any place to represent sequence weights, stockholm files don’t have “weight columns”, and I don’t know what you mean by a “per-column” weight. Email me a small reproducible example of what you’re trying to do, maybe.

    Like

  9. Hey, whenever I try to run esl-sfetch to grab multiple subsequences (even directly copying the pipe on this post and replacing the paths to my own files) I get an error saying that the program failed to fetch subsequence residues and that there might be an issue with corrupt coordinates. I’m scanning the nonredundant (nr) database, and my awk-parsed source-identifier file definitely has the “new_name start end source_identifier” format. Am I doing something wrong, or could esl-sfetch be having trouble?
    Thanks,
    Eugene

    Like

  10. Hii,how do i extract hmmsearch output sequences from my target database …i have done hmmsearch using HMMER Tools in Mobyle pastuer suit and got some hits…Now i want to extract the obtained sequences in fasta format from target database how do i go about using esl tools?i have tried to follow the above examples but iam not getting it right…
    This is my sample output…i want to get fasta sequences using the listed accessions numbers from my target database.

    Query: clustalO-multialign [M=1220]
    Scores for complete sequences (score includes all domains):
    — full sequence — — best 1 domain — -#dom-
    E-value score bias E-value score bias exp N Sequence Description
    ——- —— —– ——- —— —– —- — ——– ———–
    3.9e-262 874.2 38.1 5.6e-261 870.3 26.4 2.0 1 Os05t0492200-01 NB-ARC domain containing protein.
    3.6e-261 871.0 40.4 2.2e-225 752.4 7.0 2.2 2 Os11t0429100-00 Hypothetical conserved gene.
    7.4e-259 863.3 10.7 9.1e-259 863.0 7.4 1.0 1 Os03t0579200-00 Similar to NB-ARC domain containing protein.
    1.3e-258 862.5 55.3 3.6e-225 751.7 13.4 3.0 3 Os12t0481400-00 Similar to NBS-LRR disease resistance protei
    2.5e-258 861.6 37.8 1.1e-229 766.7 12.7 3.1 3 Os12t0481700-00 Disease resistance protein domain containing
    2.5e-258 861.6 37.8 1.1e-229 766.7 12.7 3.1 3 Os12t0482100-00 Disease resistance protein domain containing
    5.4e-253 843.9 34.5 1.6e-220 736.3 6.4 2.0 2 Os06t0707733-00 Disease resistance protein domain containing
    2.7e-248 828.3 29.3 4.6e-246 820.9 20.3 2.0 1 Os05t0380300-01 Similar to NBS-LRR protein (Fragment).
    1.1e-246 823.0 27.7 5.7e-209 698.1 8.3 2.0 2 Os02t0262800-01 Similar to NBS-LRR protein (Fragment).
    8.2e-246 820.1 37.6 2.1e-219 732.6 17.4 2.1 2 Os07t0141900-00 Similar to OSIGBa0148A10.13 protein.
    5e-239 797.6 39.5 1.2e-222 743.4 18.7 2.4 2 Os01t0359800-00 Similar to NB-ARC domain containing protein.
    5.1e-238 794.3 51.9 9e-200 667.6 18.3 3.8 2 Os04t0621500-00 Disease resistance protein domain containing
    4.9e-237 791.0 17.2 6.5e-237 790.6 11.9 1.0 1 Os12t0564800-01 NB-ARC domain containing protein.
    5.3e-237 790.9 42.7 5.2e-228 761.1 23.9 2.0 2 Os08t0543100-00 Similar to NBS-LRR type resistance protein (
    3.6e-236 788.1 47.0 2.7e-223 745.5 23.5 3.0 2 Os01t0359600-00 Disease resistance protein domain containing
    4.1e-236 788.0 43.1 2e-232 775.7 29.9 2.1 1 Os10t0130600-00 Hypothetical conserved gene.
    4.9e-236 787.7 60.2 3.4e-208 695.5 23.7 2.2 2 Os08t0296700-01 NB-ARC domain containing protein.
    1.7e-235 785.9 8.5 9e-235 783.5 5.9 1.8 1 Os11t0462900-00 Similar to NB-ARC domain containing protein,
    6.9e-235 783.9 42.6 3.6e-219 731.8 22.7 3.6 2 Os03t0848700-01 Similar to TDRGA-1 (Fragment).

    Like

  11. I’ve found what looks like an incompatibility between hmmsearch (HMMER 3.1b2) and esl-sfetch.

    When hmmsearch finds more than one matching domain in a protein, it creates additional records in the Stockholm file with the same protein ID but with the start and finish residue numbers appended.

    esl-sfetch does not understand this. I found it necessary to delete the start/finish residues from the Stockholm file – this is OK for small test datasets, but not feasible for real-world problems.

    Here’s an extract from a Stockholm file, output from hmmsearch, that gives an error (it says the protein ID is “not found” in the index file)::

    #=GS sp|P11828|GLYG3_SOYBN/36-187 DE [subseq from] Glycinin G3 OS=Glycine max GN=GY3 PE=3 SV=1
    #=GS sp|P07730|GLUA2_ORYSJ/52-211 DE [subseq from] Glutelin type-A 2 OS=Oryza sativa subsp. japonica GN=GLUA2 PE=1 SV=1

    And I edited it as below to allow esl-sfetch to work properly

    #=GS sp|P11828|GLYG3_SOYBN DE [subseq from] Glycinin G3 OS=Glycine max GN=GY3 PE=3 SV=1
    #=GS sp|P07730|GLUA2_ORYSJ DE [subseq from] Glutelin type-A 2 OS=Oryza sativa subsp. japonica GN=GLUA2 PE=1 SV=1

    Is there some switch or other way to avoid this problem?

    Like

    • I don’t think there’s any incompatibility here. I’m not sure why you’re using the Stockholm file (where you already have your subsequences!) to extract subsequences with esl-sfetch. esl-sfetch understands your sequence names just fine. If you’re trying to extract a subsequence from sp|P11828|GLYG3_SOYBN, you need to give esl-sfetch that name, and a file that contains sp|P11828|GLYG3_SOYBN. If you’re trying to extract a subsequence from sp|P11828|GLYG3_SOYBN/36-187 (i.e. which itself is a 152aa subsequence), you give esl-sfetch that name, and a file that contains sp|P11828|GLYG3_SOYBN/36-187. These are different source sequences.

      What are you trying to do? If you’re trying to fetch complete sequences that contain domains, then you need a list of names like “sp|P11828|GLYG3_SOYBN”, and a file that contains those source sequences. For the names, you get that most easily either from the main output file, in the list of sequences that contain one or more domains; or from the tabular output file (see “–tblout” option).

      The names in the Stockholm file are unique to each subsequence because they’re unique subsequences; they cannot continue to have the same (original) sequence name.

      Like

      • Thanks. I was trying to get the full sequences, and now I understand (having read this page more carefully) that I need -tblout and to use grep and awk to get the identifiers for esl-sfetch to retrieve the sequences.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s