I’m looking for four teaching fellows (TFs) for my course MCB112 Biological Data Analysis in the fall 2018 semester. TFs are typically Harvard G2 or G3 students (second- or third-year PhD students, in Harvard-speak), but can be more senior students or even postdocs. I teach the course in Python and Jupyter Notebook, using numpy and pandas, so experience in these things is a plus. Email me if you’re interested, or if you know someone else at Harvard who might be interested, let them know.
It’s come to my attention (one helpful email, plus some snarky subtweets) that the
--cpu flag of HMMER3 search programs may have a bad interaction with some cluster management software.
--cpu n argument is documented as the “number of parallel CPU workers to use for multithreads”. Typically, you want to set
n to the number of CPUs you want to use on average. But it is not the total number of threads that HMMER creates, because HMMER may also create additional threads that aren’t CPU-intensive. The number of threads that most HMMER3 programs will use is currently
n+1, I believe. The HMMER4 prototype is currently using
n+3, I believe.
The reason for the
+1 is that we have a master/worker parallelization scheme, with one master and n workers. The master is disk intensive (responsible for input/output), and the workers are CPU intensive.
The reason for the
+3 is that we are making more and more use of a technique called asynchronous threaded input to accelerate reading of data from disk. We fork off a thread dedicated to reading input, and it reads ahead while other stuff is happening. Another thread, in our current design, is there for decompression, if necessitated by the input file.
Apparently some cluster management software requires that you state the maximum number of CPUs your job will use, and if the job ever uses more than that, your job is halted or killed. So if HMMER starts n+1 threads, and the +1 thread — however CPU-nonintensive it may be — gets allocated to a free CPU outside your allotment of n, then your job is halted or killed. Which is understandably annoying.
The workaround with HMMER3 is to tell your cluster management software that you need a maximum of
n+1 CPUs, when you tell HMMER
--cpu n. You won’t use all n+1 CPUs efficiently (at best, you’ll only use n of them on average), but then, HMMER3 is typically i/o bound on standard filesystems, so it doesn’t scale to more than 2-4 CPUs well anyway.
I find it hard to believe that cluster management tools aren’t able to deal smartly with multithreaded software that combines CPU-intensive and i/o-intensive threads. I presume that there’s a good reason for these policies, and/or ways for cluster managers to configure or tune appropriately. I’m open to suggestions and pointers.
Back in the 20th century, when Todd Lowe was a happy PhD student in my lab in St. Louis instead of toiling in irons as professor and chair at the Department of Biomolecular Engineering at UCSC, he wrote a program called snoscan for searching the yeast genome sequence for 2′-O-methyl guide C/D snoRNAs (small nucleolar RNAs), using a probabilistic model that combines consensus C/D snoRNA features and a guide sequence complementarity to a predicted target methylation site in ribosomal RNA [Lowe & Eddy, 1999]. Like all the software that’s been developed in our group, snoscan has been freely available ever since. Over 30 years, our software, manuscript, and supplementary materials archive has so far survived several institutional moves (Boulder to MRC-LMB to St. Louis to Janelia Farm to Harvard), several generations of revision control system (nothing to rcs to cvs to subversion to git) and is now hosted in the cloud at eddylab.org, with some of the bigger projects also at GitHub.
It’s one thing to keep software available. It’s entirely another thing to keep it running. Software geeks know the term bit rot: a software package will gradually fall apart and stop running, even if you don’t change a thing in it, because operating systems, computer languages, compilers, and libraries evolve.
We got a bug report on snoscan the other day, that it fails to compile on Redhat Linux. (Adorably, the report also asked “how were you ever able to compile it?“. Well, the trick for compiling 18-year-old code is to use an 18-year-old system.) Indeed it does fail, in part because the modern glibc library puts symbols into the namespace that weren’t in glibc in 1999, causing a C compiler in 2017 to fatally screech about name clashes.
I just spent the day unearthing snoscan-0.9b and whacking it back into compilable and runnable shape, which felt sort of like my college days keeping my decrepit 1972 MG running with a hammer and some baling wire, except that I wasn’t cursing Lucas Electrical every five minutes. For those of you who may be trying to build old software from us — and maybe even for you idealists dreaming of ziplessly, magically reproducible computational experiments across decades — here’s a telegraphic tale of supporting a software package that hasn’t been touched since Todd’s thesis was turned in in 1999, in the form of a few notes on the main issues, followed by a change log for the hammers and baling wire I just used to get our snoscan distribution working again.
Imagine you’re a legal US resident, with a legal position in the US. You’re away visiting your family in another country, and when you try to fly back home to the US, the US government won’t let you back in. You could be a PhD student or a postdoc in a research lab here, living here for years, all your stuff and your friends are here – doesn’t matter. Tough for you. Go home to your country, US border officials tell you.
Bullshit, right? Right. Now go sign this.
These are not American values. It has to stop.
Michele Clamp is building up Harvard’s bioinformatics team on the college (Cambridge) campus, and she’s got an open position in collaboration with Susan Mango’s group:
The Mango Lab in the Harvard Department of Molecular and Cellular Biology in collaboration with the FAS Informatics Group seek a bioinformatician for genomics analysis. The successful candidate will primarily focus their work on the Mango lab, while embedded within a cohort of informaticians who develop and apply cutting edge approaches to a range of biological questions. Current projects within the Mango lab use genomics to study transcriptional regulation and chromatin dynamics during development. ChIP-seq, FAIRE-seq, RNA-seq, Oligopaint etc. are used to probe the chromatin landscape as cells transition from pluripotency to cell-fate commitment, and the successful candidate will collaborate with lab members to analyze the datasets. The remaining effort will be used to support groups in the Faculty of Arts and Sciences (FAS) that use genomic approaches such as ATAC-seq, Chip-seq, FAIRE-seq, and related methodologies. This will include consulting with faculty groups, teaching workshops on statistical and informatics approaches for understanding these methods, and developing best practice recommendations and benchmarks.
For more information, or to apply, [see here.]