Extracting HMMER results to sequence files: Easel miniapplications

The HMMER and Infernal code includes some hidden tools: the Easel library, and its "miniapplications". Easel is our code library (in the easelsubdirectory of both HMMER and Infernal), and the miniapplications (in easel/miniapps) are a set of command line utilities that we use for manipulating sequence data. For example, esl-reformat is a utility for reformatting from one sequence file format to another, and esl-sfetch is a tool for retrieving sequence(s) or subsequence(s) from a large sequence flatfile. These utilities work together with HMMER and Infernal to enable sequence analysis in a flexible, arcane, unix-y command line sort of way.

For example, yesterday someone wrote to ask, suppose I want to extract all the sequences that were hit by a HMMER hmmsearch, and save them in a separate file in FASTA format -- how do I do that? This is a good example for introducing Easel's miniapplications.

Saving alignments from searches

One way to extract the hits is with the -A option to the search programs (hmmsearch, phmmer, jackhmmer). This saves all the domains (subsequences) that were significant (passed the "inclusion thresholds") as one alignment, in Stockholm format. Then we can use esl-reformat to convert the alignment to FASTA format:

hmmsearch -A myhits.sto tutorial/fn3.hmm uniprot_sprot.fasta
esl-reformat fasta myhits.sto > myhits.fa

That only extracts the domains that matched the model -- what if we want the entire original sequences? And what if we only want a subset of those sequences, filtered by some crazy criterion of our own?

esl-sfetch: extract (sub)sequence(s) from a large sequence flatfile

The esl-sfetch program (in easel/miniapps) is for extracting arbitrary subsets of sequences or subsequences from a sequence file. The first thing you need to do to use esl-sfetch on a file is to create an "SSI index" for that file:

esl-sfetch --index uniprot_sprot.fasta

Now you have a file uniprot_sprot.fasta.ssi, a binary index for your sequence datafile.

To extract a single sequence by name or accession:

esl-sfetch uniprot_sprot.fasta "sp|A2ASS6|TITIN_MOUSE"

(Note that esl-sfetch doesn't try to parse the semantics of a FASTA id; it doesn't know that sp|A2ASS6|TITIN_MOUSE is a concatenation of the SwissProt accession A2ASS6 and the SwissProt identifier TITIN_MOUSE. We'll probably change that in the future, because it's a hassle not to be able to use the accession or identifier alone when FASTA name/description lines are constructed this way. Also note that because of the | symbols, this name has to be double-quoted on the commandline.)

To extract lots of sequences at the same time, esl-sfetch -f takes a file of sequence identifiers instead of a single identifier. For example if the file myhits.list consists of three names:

% cat myhits.list sp|Q8WZ42|TITIN_HUMAN sp|P04937|FINC_RAT sp|Q60847|COCA1_MOUSE

then we can fetch them all at once like this:

esl-sfetch -f uniprot_sprot.fasta myhits.list > myhits.fasta

esl-sfetch can read from a pipe

Now for a neat trick. esl-sfetch can read its input file from a pipe, if you use '-' as an argument instead of a file name. This means you can do any unix-y, perl-y, awk-y command line invocation you want, and if it results in a list of sequence identifiers, you can pipe it into esl-sfetch.

For example, suppose in the question we started with, we want to extract the original (full-length) sequences that contain one or more HMMER hits. We can first save a tabular HMMER output file, then manipulate that tabular data to create the fetching list we want. (Remember that HMMER's tabular output files contain comment lines starting with a # symbol, so invocations typically begin with grep -v "^#" to remove comments.)

hmmsearch --tblout myhits.tbl tutorial/fn3.hmm ~/data/uniprot_sprot.fasta grep -v "^#" myhits.tbl | awk '{print $1}' | esl-sfetch -f uniprot_sprot.fasta - > myhits.fa

esl-sfetch is efficient. This example retrieves 729 sequences from SwissProt in about 300 milliseconds, on my Mac desktop.

esl-sfetch for lots of subsequences

With the -C option in addition to -f, the input file for esl-sfetch consists not just of a list of names, but also includes start/end coordinates of a subsequence. Specifically, each line has the (whitespace-delimited) format with four fields:

new_name start end source_identifier

So, suppose you didn't want to fetch the complete sequences, but you wanted to fetch the subsequences that got hit in a HMMER search, using their "envelope" coordinates. We could start with a domain tabular output (--domtblout) and fetch as follows:

hmmsearch --domtblout myhits.dtbl tutorial/fn3.hmm ~/data/uniprot_sprot.fasta grep -v "^#" myhits.dtbl | awk '{print $1"/"$20"-"$21, $20, $21, $1}' | esl-sfetch -Cf uniprot_sprot.fasta - > myhits.fa

That example retrieves 2689 subsequences in about a second, on my Mac.

for more information

We don't consider Easel to be quite ready for prime time yet, which is why we haven't said much about it. But it's getting there, and we welcome feedback (and code contributions!) for making it better. The best place to learn more about the Easel miniapps is to poke around in the source code directory (for example, HMMER's easel/miniapps), where you'll find UNIX man pages for each miniapp. Read 'em with, for example:

man ./esl-alistat.man

Each of them will also dump a short help page if you give them a -h option:

esl-alistat -h