Biological Data Analysis

I'm starting to plan a new Harvard course that'll be called Biological Data Analysis. Biology is going through a culture change. It's suddenly become a data-rich, computational analysis-heavy science. Are we going to outsource data analysis to bioinformaticians and data science specialists, or are biologists going to analyze their own data? There's always advantages to specialization, and we need bioinformatics and data science. But I also feel that we are dangerously weak in training biologists to think about their own data. The usual response I get when I talk about it is something like "you can't expect wet lab biologists to learn how to program".

What I want to teach in Biological Data Analysis is that writing scripts and using the command line for data analysis is not software engineering, it's just a simple and essential thing that a wet lab biologist can do, and needs to do. I'm going to teach from the point of view that biologists already have a special advantage in large-scale data analysis: we are trained to expect that we will be screwed by our experiments. We should be treating data analysis the same way. Like doing an experiment on a complicated organism, any given data analysis only gives you a narrow glimpse into a large data set. God only knows what else is going on in the data that you're not seeing. Like doing experiments, you need to design positive and negative controls to protect yourself from the hundred different ways that nature (and computers) are going to mess with you. Writing scripts that generate positive and negative controls for a data analysis is a powerful and biologically motivated thing to know how to do.

Once you're generating negative control data -- "here's what the data would look like if there were no effect to be found" -- you're actually doing statistics, but in an intuitive and motivated way that any biologist can understand. Instead of learning a bunch of incantations and lore about t-tests, you're forced to think directly about what your null hypothesis is, because you have to make a negative control data set according to that null hypothesis. The "p-value" is  directly the probability that you observe a signal in your negative control. This style of simulation-driven analysis is enabled by modern computational power plus the ability to write simple scripts and use the command line. You don't have to learn statistics per se. You have to learn how to do computational control experiments. I expect that if a biologist learns the simulation-driven style of analysis first, then they're motivated to go on to learn more serious statistical analysis as they need it... and they're armed with a powerful way to check analytic results against intuitive simulations.

If this makes any sense to you, and if you happen to be a Harvard PhD student graduating this year, and you're thinking it might be nice to take a year and do some teaching... boy, do I have a deal for you. Harvard has a thing called the College Fellows Program. This is a one-year position (renewable for one more) that focuses on teaching and course development. We've just posted an ad looking for a College Fellow to help me develop and teachBiological Data Analysis. Application deadline is April 15. Feel free to contact me directly with questions!