Thread count in HMMER3

It’s come to my attention (one helpful email, plus some snarky subtweets) that the --cpu flag of HMMER3 search programs may have a bad interaction with some cluster management software.

The --cpu n argument is documented as the “number of parallel CPU workers to use for multithreads”. Typically, you want to set n to the number of CPUs you want to use on average. But it is not the total number of threads that HMMER creates, because HMMER may also create additional threads that aren’t CPU-intensive. The number of threads that most HMMER3 programs will use is currently n+1, I believe. The HMMER4 prototype is currently using n+3, I believe.

The reason for the +1 is that we have a master/worker parallelization scheme, with one master and n workers. The master is disk intensive (responsible for input/output), and the workers are CPU intensive.

The reason for the +3 is that we are making more and more use of a technique called asynchronous threaded input to accelerate reading of data from disk. We fork off a thread dedicated to reading input, and it reads ahead while other stuff is happening. Another thread, in our current design, is there for decompression, if necessitated by the input file.

Apparently some cluster management software requires that you state the maximum number of CPUs your job will use, and if the job ever uses more than that, your job is halted or killed. So if HMMER starts n+1 threads, and the +1 thread — however CPU-nonintensive it may be — gets allocated to a free CPU outside your allotment of n, then your job is halted or killed. Which is understandably annoying.

The workaround with HMMER3 is to tell your cluster management software that you need a maximum of n+1 CPUs, when you tell HMMER --cpu n. You won’t use all n+1 CPUs efficiently (at best, you’ll only use n of them on average), but then, HMMER3 is typically i/o bound on standard filesystems, so it doesn’t scale to more than 2-4 CPUs well anyway.

I find it hard to believe that cluster management tools aren’t able to deal smartly with multithreaded software that combines CPU-intensive and i/o-intensive threads. I presume that there’s a good reason for these policies, and/or ways for cluster managers to configure or tune appropriately. I’m open to suggestions and pointers.

snoscan and squid in the 21st century

Back in the 20th century, when Todd Lowe was a happy PhD student in my lab in St. Louis instead of toiling in irons as professor and chair at the Department of Biomolecular Engineering at UCSC, he wrote a program called snoscan for searching the yeast genome sequence for 2′-O-methyl guide C/D snoRNAs (small nucleolar RNAs), using a probabilistic model that combines consensus C/D snoRNA features and a guide sequence complementarity to a predicted target methylation site in ribosomal RNA [Lowe & Eddy, 1999]. Like all the software that’s been developed in our group, snoscan has been freely available ever since. Over 30 years, our software, manuscript, and supplementary materials archive has so far survived several institutional moves (Boulder to MRC-LMB to St. Louis to Janelia Farm to Harvard), several generations of revision control system (nothing to rcs to cvs to subversion to git) and is now hosted in the cloud at, with some of the bigger projects also at GitHub.

It’s one thing to keep software available. It’s entirely another thing to keep it running. Software geeks know the term bit rot: a software package will gradually fall apart and stop running, even if you don’t change a thing in it, because operating systems, computer languages, compilers, and libraries evolve.

We got a bug report on snoscan the other day, that it fails to compile on Redhat Linux. (Adorably, the report also asked “how were you ever able to compile it?“. Well, the trick for compiling 18-year-old code is to use an 18-year-old system.) Indeed it does fail, in part because the modern glibc library puts symbols into the namespace that weren’t in glibc in 1999, causing a C compiler in 2017 to fatally screech about name clashes.

I just spent the day unearthing snoscan-0.9b and whacking it back into compilable and runnable shape, which felt sort of like my college days keeping my decrepit 1972 MG running with a hammer and some baling wire, except that I wasn’t cursing Lucas Electrical every five minutes. For those of you who may be trying to build old software from us — and maybe even for you idealists dreaming of ziplessly, magically reproducible computational experiments across decades — here’s a telegraphic tale of supporting a software package that hasn’t been touched since Todd’s thesis was turned in in 1999, in the form of a few notes on the main issues, followed by a change log for the hammers and baling wire I just used to get our snoscan distribution working again.

Continue reading →

Go sign this.

Imagine you’re a legal US resident, with a legal position in the US. You’re away visiting your family in another country, and when you try to fly back home to the US, the US government won’t let you back in. You could be a PhD student or a postdoc in a research lab here, living here for years, all your stuff and your friends are here – doesn’t matter. Tough for you. Go home to your country, US border officials tell you.

Bullshit, right? Right. Now go sign this.

These are not American values. It has to stop.

Open position: Bioinformatics scientist

Michele Clamp is building up Harvard’s bioinformatics team on the college (Cambridge) campus, and she’s got an open position in collaboration with Susan Mango’s group:

The Mango Lab in the Harvard Department of Molecular and Cellular Biology in collaboration with the FAS Informatics Group seek a bioinformatician for genomics analysis. The successful candidate will primarily focus their work on the Mango lab, while embedded within a cohort of informaticians who develop and apply cutting edge approaches to a range of biological questions. Current projects within the Mango lab use genomics to study transcriptional regulation and chromatin dynamics during development. ChIP-seq, FAIRE-seq, RNA-seq, Oligopaint etc. are used to probe the chromatin landscape as cells transition from pluripotency to cell-fate commitment, and the successful candidate will collaborate with lab members to analyze the datasets. The remaining effort will be used to support groups in the Faculty of Arts and Sciences (FAS) that use genomic approaches such as ATAC-seq, Chip-seq, FAIRE-seq, and related methodologies. This will include consulting with faculty groups, teaching workshops on statistical and informatics approaches for understanding these methods, and developing best practice recommendations and benchmarks.

For more information, or to apply, [see here.]

New open positions

We have two new open positions in the group. We are looking for a bioinformatics analyst and a scientific software engineer to join the teams that develop the HMMER and Infernal software packages for biological sequence analysis. The teams are growing, with a key new team member Nick Carter, who joined us from high performance computing research at Intel and a previous computer science faculty position at U Illinois. We’re aiming in particular to bring out the next release of HMMER, the long-fabled HMMER4. These positions offer the short-term opportunity to help us bring these existing projects to the next level, and a longer-term opportunity to participate in a variety of fundamental computational biology algorithms research and software development.

The scientific software engineer will work most closely with Nick and myself on HMMER, and later on Infernal in collaboration with Eric Nawrocki (NIH/NCBI). Our codebases are ANSI C99, we take advantage of SIMD vectorization instructions on multiple platforms, and we’re working hard on parallelization efficiency with multithreading (POSIX threads) and message passing (MPI), so we’re looking for someone with expertise in these technologies. We currently work primarily on Apple OS/X and Linux platforms ourselves, but our code has to build and work on any POSIX-compliant platform, so we’re also looking to expand our automated build/test procedures. Our codebases are open source and we work with standard open source tools such as git, autoconf, and GNU make, and development is distributed amongst a small group of people throughout the world, especially including collaborators at NIH NCBI, U Montana, HHMI Janelia Farm, and Cambridge UK; you can see our github repos here. The official advertisement for the position is online here, with instructions on how to apply.

The bioinformatics analyst will work most closely with our bioinformatics secret agent man Tom Jones, Nick, and myself. We’re looking for someone to bridge the gap between the computational engineering and the biological applications, someone who will have one hand in the development team (working on how our command line interfaces work, and how our output formats come out), and another hand on our collaborations within Harvard and with the outside world using and testing our software on real problems. We want someone thinking about benchmarking and testing, with expertise in a scripting language (Python, Perl); we want someone working on user-oriented documentation and tutorials (Jekyll, Markdown); we want someone thinking about ease and elegance of use; we want someone working on how our tools play well with others, including Galaxy and BioConductor, and liasing between us and the various package managers who bundle our software (such as Homebrew, MacPorts, all the various Linuxen). The official ad and instructions to apply are online here.

We are reading applications now, and will accept applications in a rolling fashion for at least another month or so; beyond that depends on how our candidate pool looks. The positions are available immediately. Our funding for them has just activated, supported by the NIH NHGRI under the NIH’s program (PA-14-156) for Extended Development, Hardening and Dissemination of Technologies in Biomedical Computing, Informatics, and Big Data Science; we gratefully acknowledge this new support, as I’ve rejoined the NIH community after ten years away at a monastery.

from zero to python


Students actually showed up, so we really do have to teach the course. MCB112 Biological Data Analysis is now in its first week.

The tricksiest bit in the first couple weeks is bringing people up to speed in writing Python, for people who’ve never written code before. We trust in the power of trial and error. We give working example scripts that are related to what the students are asked to do on a problem set. Developing code by mutation, descent with modification, and selection: coding for biologists.

Soon we’ll start to lift the training wheels, while trying not to leave people in a “now draw the rest of the damn owl” situation.

When you’re learning to code, with every line you type you’re looking something up. Your concentration is getting broken all over the place as you try to express the Simplest Stupid Thing (Why Don’t You Work gaaaah $%^&#@). If you’re also trying to learn something else at the same time that requires hard thinking – an algorithm, a mathematical equation, a biological analysis approach – really just about the last thing you need is to have your concentration broken every ten seconds because you can’t express yourself. The best way to learn to code isn’t to start by writing scientific code. It’s better to code something fun, something that you’re completely absorbed by, something that isn’t too conceptually difficult. You want to have only the code frustrating you, while the goal pulls you in and keeps you engaged.

But I can’t exactly recommend that students learn to code the way that I did. Sure, go get yourself absorbed in an early Internet massive military-industrial simulation game. Automate your country’s economy, re-invent Dijkstra’s shortest path algorithm to distribute your resources, make an interactive display of your map, reverse engineer the client/server communication interface so you can launch automated attacks… no, this is no way to do a PhD. Even if it does mean you end up knowing C and Perl and understanding dynamic programming, GUI development, and networked computing.

So alas, we’ll try to generate entertainment value in more socially acceptable ways, like sand mouse mysteries in the problem sets, or teasing Lior Pachter. We’ll see if it’s enough. If not, maybe I’ll have to see if the old Empire code still compiles.

open position: lab coordinator

We have an open position for a new lab coordinator. Ads are running at Nature and elsewhere. Our LC is our point of contact for everything administrative that makes the lab run smoothly, including responsibility for our financial accounting/purchasing with both HHMI’s and Harvard’s systems, tracking our funding, managing our travel planning, and coordinating the lab’s internal scheduling and communications (a mix of Google Calendar, Google Drive, OneNote, and Slack), among other jack-of-all-trades things. If you know of candidates, pass this info along — and if you’re a candidate yourself, see one of the ads for info on how to apply (to HHMI HR, who do the preliminary resume screening for us).

Biological Data Analysis

I’m starting to plan a new Harvard course that’ll be called Biological Data Analysis. Biology is going through a culture change. It’s suddenly become a data-rich, computational analysis-heavy science. Are we going to outsource data analysis to bioinformaticians and data science specialists, or are biologists going to analyze their own data? There’s always advantages to specialization, and we need bioinformatics and data science. But I also feel that we are dangerously weak in training biologists to think about their own data. The usual response I get when I talk about it is something like “you can’t expect wet lab biologists to learn how to program”.

What I want to teach in Biological Data Analysis is that writing scripts and using the command line for data analysis is not software engineering, it’s just a simple and essential thing that a wet lab biologist can do, and needs to do. I’m going to teach from the point of view that biologists already have a special advantage in large-scale data analysis: we are trained to expect that we will be screwed by our experiments. We should be treating data analysis the same way. Like doing an experiment on a complicated organism, any given data analysis only gives you a narrow glimpse into a large data set. God only knows what else is going on in the data that you’re not seeing. Like doing experiments, you need to design positive and negative controls to protect yourself from the hundred different ways that nature (and computers) are going to mess with you. Writing scripts that generate positive and negative controls for a data analysis is a powerful and biologically motivated thing to know how to do.

Once you’re generating negative control data — “here’s what the data would look like if there were no effect to be found” — you’re actually doing statistics, but in an intuitive and motivated way that any biologist can understand. Instead of learning a bunch of incantations and lore about t-tests, you’re forced to think directly about what your null hypothesis is, because you have to make a negative control data set according to that null hypothesis. The “p-value” is  directly the probability that you observe a signal in your negative control. This style of simulation-driven analysis is enabled by modern computational power plus the ability to write simple scripts and use the command line. You don’t have to learn statistics per se. You have to learn how to do computational control experiments. I expect that if a biologist learns the simulation-driven style of analysis first, then they’re motivated to go on to learn more serious statistical analysis as they need it… and they’re armed with a powerful way to check analytic results against intuitive simulations.

If this makes any sense to you, and if you happen to be a Harvard PhD student graduating this year, and you’re thinking it might be nice to take a year and do some teaching… boy, do I have a deal for you. Harvard has a thing called the College Fellows Program. This is a one-year position (renewable for one more) that focuses on teaching and course development. We’ve just posted an ad looking for a College Fellow to help me develop and teach Biological Data Analysis. Application deadline is April 15. Feel free to contact me directly with questions!