Cloud computing services offered by companies such as Amazon, Microsoft and Google have put high-performance computing in the hands of ordinary researchers. And over the past decade, a project called Bioconductor has done something similar for the often complex field of bioinformatics.
Started in 2001 by a group of bioinformatists led by Robert Gentleman and later at Harvard University’s T. H. Chan School of Public Health in Boston, Massachusetts, the Bioconductor project provides a suite of software for researchers and engineers to analyze, visualize, and more. and facilitates the exchange of genomic data. The project contains thousands of tools for computational molecular biology, all of which work together in the statistical programming language R in combination with the RStudio programming environment.
Like R and RStudio, Bioconductor is open source; all three can be downloaded and installed for free. But genomic data sets can be large and require more computing power, memory, or disk space than researchers expect. Fortunately, bioconductor also comes as a ready-made configuration that can be run using virtually unlimited resources available in Amazon Cloud or similar services, at a cost of less than $0.20 per hour.
In fact, the cloud has made computer hardware cheaper, and Bioconductor has released genomics software. Good documentation, tutorials and courses make these methods accessible to both IT professionals and beginners. The promise of a biowire running in the cloud is to reduce the cost of computational biology while flattening the learning curve of molecular biologists and improving productivity.
But is it really as simple as it seems? I talked to experts on bio-conductors and beginners, and then I participated in this workshop on genomics myself.
The first step was to select an experimental task and data. There’s no shortage of options. The latest update (3.6), released at the end of October, includes about 1,500 software packages, 326 experimental data sets and 911 annotation tools. The PubMed literature database indexes about 1000 articles that mention the use of biological conductors in various ways. In a 2017 study1, software was deployed to study gene expression patterns in people with a life-threatening parasitic disease before and after treatment. Another analysis of the profiles of metabolites of vineyards affected by drought.
According to Michael Love, a geneticist at the University of North Carolina at Chapel Hill, Bioconductor “has become the standard for performing various tasks in high-speed genomic data, such as analyzing gene expression, as well as for matching genomic annotations from different sources.” . The package is increasingly being used for epigenetics and metagenomics, image processing and proteomics.
Love now requires students of their introductory graduate courses in computational biology to learn how to use biological conductors. Therefore, I sought advice from Loves doctoral student Anki Zhu, who started using Bioconductor to analyze the differential expression of transcriptomic data about a year ago. Zhu recommends viewing a variety of tutorials and manuals, as well as practical demonstrations, called miniatures, on Bioconductor.org (see “immersion in a biological conductor”).
Peer-reviewed tutorials called workflows are also available, which are updated as the platform evolves. 3A, co-authored with Love, guides readers through the analysis of RNA sequencing data between expressions. I used his workflow to guide my research.
I previously opened an account in Amazon Elastic Compute Cloud (EC2) and previously set up a cloud server. But for those who are familiar with cloud computing, Bioconductor provides step-by-step instructions. Installing Bioconductor on an EC2 server with 4 processor cores and 16 gigabytes of memory simply requires entering the code of the desired bioconductor configuration, selecting several options and clicking the “start” button. In less than an hour, I started the server, connected to the RStudio software running on that server, and started my Love workflow. He extracted RNA sequencing data collected from human respiratory tract cells to identify genes that are differentially expressed when cells are treated with corticosteroids.
Most of the work in Bioconductor is done by entering the R code directly on the RStudio command line, rather than by pointing and clicking the mouse. The RStudio development environment executes R commands and displays the results. It also provides interactive explanations of R functions and bio–conductors and can display the values of variables and data structures – useful functions for debugging code.