ABSTRACT: In biology and medicine data growth is driven by faster data acquisition techniques. Not only is the size of handling the data becoming troublesome (one sequencer generating tera-bytes per day), but also the nature of analysis is changing. Where before biologists could get away with massive data reduction, e.g. focusing on single genes of a few individuals. But now we are gathering multiple samples from hundreds of thousands humans, at different times in their lives, and we need to literally compare everything with everything, massively exacerbating computational time.
We predict biology is going to be one of the massive consumers of computational time, and that computational biologists need to learn quickly from HPC scientists. Unfortunately, there is a gap in understanding of physical properties of computer hardware, and an underestimation of the intricacies of parallelized computation, as we try to explain in 'Big data, but are we ready?' (see http://www.nature.com/nrg/journal/v12/n3/full/nrg2857-c1.html).
We work on software (R/qtl) for big data genetical genomics, which uses powerful statistical techniques correlating sections of the genome (DNA) with gene expression and metabolite traits. We are working on three host-pathogen combinations, measuring RNA-seq to find
interactions (e.g. with mouse-virus, worm-bacteria, and plant-worm), where next generation sequencing is used on measuring gene expression at time of infection. The analysis of this data has grown beyond the single desktop computer, and we see this as an opportunity to create next-generation parallelized software in biology.