Select Agents should be defined by DNA sequence

The National Academy of Sciences has just released the report Sequence-Based Classification of Select Agents: A Brighter Line. A committee of 13 of us, chaired by Jim LeDuc (Director, Galveston National Laboratory, University of Texas Medical Branch) and cat-herded by India Hook-Barnard (NAS), has been working on this report for the past year or so. It's good to see it done. There is an NAS press release, and Natureand Science, among others, have already picked up the story.

The Select Agents are a list of 82 things (bacteria, viruses, fungi, protein toxins, and metabolite toxins) that are considered to be sufficiently dangerous that there arespecial regulations controlling who can possess or ship them. The problem with a list of names like "Ebola" or "Bacillus anthracis" (anthrax) is that biological entities don't really conform to nicely discrete names; "species" is an awkward word, especially for viruses and bacteria. What is "anthrax", really, when bacilli like Bacillus anthracis and Bacillus cereus seem to be swapping plasmids around that may or may not encode pathogenicity determinants? As long as strains are obtained by a "chain of custody" -- i.e., someone ships someone else a sample of a known, already labeled agent -- a system based on taxonomic names doesn't work too badly. The only wrinkle comes when a new strain is isolated from nature, and gets named (taxonomically classified), but viral and microbial taxonomy isn't too bad, especially when the biological characteristics of the bug are at least partially observable clinically or in the lab.

Synthetic genomics is a new (-ish) way to obtain organisms. Starting with positive-strand RNA viruses, we're slowly but steadily developing the technology to "boot up" organisms from synthetic nucleic acid, DNA or RNA. The Select Agent Regulations deal with this by including the "complete", "infectious" DNA or RNA genome of a Select Agent, not just the organism itself. But if you're a DNA synthesis company, and someone sends you an order for a piece of DNA that isn't exactly identical to any Select Agent, but shows sequence similarity to a Select Agent, what to do? Lots of innocuous things show sequence similarity to Select Agents; almost all their genes are conserved things like polymerases or whatnot that have little to do with pathogenesis per se. And vaccine strains and nonpathogenic research strains could be nearly 100% identical, of course. Is the DNA synthesis company supposed to make a decision on their own? Where should they draw the line? If they are held to have constructed and/or shipped a complete Select Agent genome, they've committed a crime. You can't just tell them that Select Agents are functionally defined by biological assays, because that's a catch-22: if they synthesize and boot an organism to test if it's viable and pathogenic, they've already committed the crime. They need to be able to tell unambiguously, just from DNA sequence alone, if the sequence satisfies the Select Agent program definition of a Select Agent. Otherwise the application of the Select Agent regulations to genome sequences is going to get ever more murky.

Because what about modifications to Select Agent genomes -- could someone evade the regulations by making silent mutations in a synthetic genome? What about chimeras and synthetic biology designs, mixing and matching nasty bits? What about an entirely de novo designed superpathogen, with no sequence similarity to anything we've seen before?

These are the concerns that led to our committee's charge. We were asked to consider what it would take to replace the current list-based system for defining the Select Agents by taxonomic names, with a predictive system that predicts complete-Select-Agent-ness from a DNA sequence. Everyone recognizes that that sort of prediction is currently infeasible (and our report gives chapter and several appended verses on why), but we were asked to lay out the research milestones towards making it feasible.

We argued that this can't and shouldn't be done. It can't be done, because things don't just get on the Select Agent list because of biological properties encoded in their genome; other criteria are in play, such as whether the organism is easy to obtain from the wild (thus futile to regulate access to it) or whether there's a vaccine or medical countermeasures. It shouldn't be done, because actively trying to develop a system for predicting Select-Agent-ness from sequence is exactly what would enable the design of novel Select Agent pathogens. We don't want to develop this ability out in front of other predictive ability in biology, such as our ability to engineer good stuff (biofuels, biomaterials) and, in particular, good countermeasures against pathogens.

But "no, and don't" is a pretty clippy answer to the charge, and it doesn't address the problem -- what should we do about the challenges that synthetic genomics poses to the name-based Select Agent list?

Our committee went on to propose that sequence-based classification is technically feasible. Computational tools like the Pfam database (for example) are used to operationally define inclusion in particular biological sequence families. It would be relatively straightforward to operationally define each Select Agent in terms of some number of essential "parts" (genes), and a sequence-based classification that defined the "sequence space" around each part. The report dwells on the technical details of this at some length. The key idea is given a set of sequences that you do want labeled as Select Agents, and a set of sequences that you do not want regulated (including everything else known in biology, including vaccines and nonpathogenic research strains), it is always possible to achieve 100% classification accuracy. The current state of knowledge of Select Agents and non-Select Agents can be captured. The problem, of course, would come in how the classification system dealt with newly isolated organisms and genome sequences; new things might get classified in undesired ways (in either direction), so sequence-based definitions need to be updated as our state of knowledge changes.

Because such a system is a little cumbersome and pedantic (you'd need to employ a team of sequence analyst/curators, plus scientific advisors expert in each Select Agent pathogen and its nearest non-SA relatives, and that means not only financial cost but opportunity cost -- those people could be off finding cures for pathogens, rather than pedantically defining them!), and because it means a long-term investment (because of the need for updating), and because we'd already wandered a bit from our actual charge, we stopped short of actually recommending that a sequence-based classification system be built.

Instead, we recommended that the Select Agent list of names should be augmented by a sequence-based definition of what each name really means (the advent of synthetic genomics pretty much forces this), and said that our proposal offered one way that could be done. We also mention that it would help a lot if the Select Agent list were a lot shorter than it is now, focusing heavyweight regulatory effort on the truly most dangerous pathogens (smallpox is the worst). Thankfully that's consistent with an Obama administration Executive Order that just appeared a few weeks ago, and consistent with recommendations of several other panels.

We went on to discuss how the same system could be used as part of a "yellow flag" biosafety system. The Select Agent regulations only cover "complete" genomes. What about pieces of SAs? It would be prudent to keep an eye out for such things, and follow up to make sure that anyone ordering pieces of SA genomes is legitimate and knows what they're doing. DNA synthesis companies already screen their orders, but there has been a lot of discussion (and recent Federal Register guidance) on how to harmonize that screening, to be sure everyone has a set of best practices.

There's all that and more in the 216 page report. How the press is going to fit all this into concise stories will be interesting. It just took me 1200+ words, I see.