DESeq2 has an official extension within the phyloseq package and an accompanying vignette. The vignette has been copied/included here for continuity, and as you can see, phyloseq_to_deseq2
does not need to be defined before using it because it is already available when you load phyloseq.
If you find this extension or tutorial useful in your work, please cite the following:
McMurdie and Holmes (2014) Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible. PLoS Computational Biology in press
McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 8(4):e61217
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15(12): 550
In this example I use the publicly available data from a study on colorectal cancer:
Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.
As a side-note, this work was published ahead of print in Genome Research alongside a highly-related article from a separate group of researchers (long-live reproducible observations!): Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. In case you are interested. For the purposes of example, however, we will stick to the data from the former study, with data available at the microbio.me/qiime server.
Study ID: 1457
Project Name: Kostic_colorectal_cancer_fusobacterium
Study Abstract: The tumor microenvironment of colorectal carcinoma is a complex community of genomically altered cancer cells, nonneoplastic cells, and a diverse collection of microorganisms. Each of these components may contribute to carcino genesis; however, the role of the microbiota is the least well understood. We have characterized the composition of the microbiota in colorectal carcinoma using whole genome sequences from nine tumor/normal pairs. Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA sequence analysis of 95 carcinoma/normal DNA pairs, while the Bacteroidetes and Firmicutes phyla were depleted in tumors. Fusobacteria were also visualized within colorectal tumors using FISH. These findings reveal alterations in the colorectal cancer microbiota; however, the precise role of Fusobacteria in colorectal carcinoma pathogenesis requires further investigation.
Start by loading phyloseq.
library("phyloseq"); packageVersion("phyloseq")
## [1] '1.7.12'
Defined file path, and import the published OTU count data into R.
filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", package="phyloseq")
kostic = microbio_me_qiime(filepath)
## Found biom-format file, now parsing it...
## Done parsing biom...
## Importing Sample Metdadata from mapping file...
## Merging the imported objects...
## Successfully merged, phyloseq-class created.
## Returning...
Here I had to use a relative file path so that this example works on all systems that have phyloseq installed. In practice, your file path will look like this (if you've downloaded the data ahead of time):
filepath = "~/Downloads/study_1457_split_library_seqs_and_mapping.zip"
kostic = microbio_me_qiime(filepath)
Or like this (if you're accessing data directly from the microbio.me/qiime server directly):
kostic = microbio_me_qiime(1457)
In this example I'm using the major sample covariate, DIAGNOSIS
, as the study design factor. The focus of this study was to compare the microbiomes of pairs of healthy and cancerous tissues, so this makes sense. Your study could have a more complex or nested design, and you should think carefully about the study design formula, because this is critical to the test results and their meaning. You might even need to define a new factor if none of the variables in your current table appropriately represent your study's design. See the DESeq2 home page for more details.
Here is the summary of the data variable kostic
that we are about to use, as well as the first few entries of the DIAGNOSIS
factor.
kostic
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2505 taxa and 190 samples ]
## sample_data() Sample Data: [ 190 samples by 71 sample variables ]
## tax_table() Taxonomy Table: [ 2505 taxa by 7 taxonomic ranks ]
head(sample_data(kostic)$DIAGNOSIS, 25)
## [1] Healthy Tumor Tumor Healthy Healthy Healthy Tumor Healthy
## [9] Healthy Healthy Healthy Healthy Healthy Healthy Healthy Tumor
## [17] Healthy Healthy Healthy Healthy Healthy Tumor Tumor Tumor
## [25] Healthy
## Levels: Healthy None Tumor
Unfortunately, the diagnosis variable has a third placeholder class indicating that no diagnosis was given ("None"
). For the purposes of testing, these samples will be removed.
kostic = subset_samples(kostic, DIAGNOSIS != "None")
kostic
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2505 taxa and 185 samples ]
## sample_data() Sample Data: [ 185 samples by 71 sample variables ]
## tax_table() Taxonomy Table: [ 2505 taxa by 7 taxonomic ranks ]
First load DESeq2.
library("DESeq2")
packageVersion("DESeq2")
## [1] '1.2.10'
The following two lines actually do all the complicated DESeq2 work. The function phyloseq_to_deseq2
converts your phyloseq-format microbiome data into a DESeqDataSet
with dispersions estimated, using the experimental design formula, also shown (the ~DIAGNOSIS
term). The DESeq
function does the rest of the testing, in this case with default testing framework, but you can actually use alternatives.
diagdds = phyloseq_to_deseq2(kostic, ~ DIAGNOSIS)
diagdds = DESeq(diagdds, test="Wald", fitType="parametric")
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
Note: The default multiple-inference correction is Benjamini-Hochberg, and occurs within the DESeq
function.
The following results
function call creates a table of the results of the tests. Very fast. The hard work was already stored with the rest of the DESeq2-related data in our latest version of the diagdds
object (see above). I then order by the adjusted p-value, removing the entries with an NA
value. The rest of this example is just formatting the results table with taxonomic information for nice(ish) display in the HTML output.
res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.01
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kostic)[rownames(sigtab), ], "matrix"))
head(sigtab)
## baseMean log2FoldChange lfcSE stat pvalue padj Kingdom
## 304309 1.686 -0.6409 0.1823 -3.516 4.380e-04 0.0018912 Bacteria
## 16076 13.805 -0.8583 0.2094 -4.099 4.144e-05 0.0002678 Bacteria
## 561483 1.866 -0.6382 0.1809 -3.529 4.178e-04 0.0018193 Bacteria
## 177005 8.598 -0.7467 0.1905 -3.920 8.865e-05 0.0004876 Bacteria
## 469778 9.049 -0.6633 0.2209 -3.003 2.673e-03 0.0080976 Bacteria
## 308873 4.165 -0.5360 0.1794 -2.988 2.809e-03 0.0084662 Bacteria
## Phylum Class Order
## 304309 Firmicutes Clostridia Clostridiales
## 16076 Firmicutes Clostridia Clostridiales
## 561483 Actinobacteria Actinobacteria (class) Bifidobacteriales
## 177005 Firmicutes Clostridia Clostridiales
## 469778 Bacteroidetes Bacteroidia Bacteroidales
## 308873 Firmicutes Clostridia Clostridiales
## Family Genus Species
## 304309 Lachnospiraceae Blautia Blautia producta
## 16076 Ruminococcaceae Ruminococcus Ruminococcus bromii
## 561483 Bifidobacteriaceae Bifidobacterium Bifidobacterium longum
## 177005 Lachnospiraceae Blautia <NA>
## 469778 Bacteroidaceae Bacteroides Bacteroides coprophilus
## 308873 Ruminococcaceae Clostridium Clostridium orbiscindens
dim(sigtab)
## [1] 218 13
Let's look at the OTUs that were significantly different between the two tissues. The following makes a nice ggplot2 summary of the results.
library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
scale_fill_brewer(palette = palname, ...)
}
# Phylum order
x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
# Genus order
x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtab$Genus = factor(as.character(sigtab$Genus), levels=names(x))
ggplot(sigtab, aes(x=Genus, y=log2FoldChange, color=Phylum)) + geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))