Define the extension. Do not need to have anything loaded yet to do this.
################################################################################
#' Convert phyloseq OTU count data into DGEList for edgeR package
#'
#' Further details.
#'
#' @param physeq (Required). A \code{\link{phyloseq-class}} or
#' an \code{\link{otu_table-class}} object.
#' The latter is only appropriate if \code{group} argument is also a
#' vector or factor with length equal to \code{nsamples(physeq)}.
#'
#' @param group (Required). A character vector or factor giving the experimental
#' group/condition for each sample/library. Alternatively, you may provide
#' the name of a sample variable. This name should be among the output of
#' \code{sample_variables(physeq)}, in which case
#' \code{get_variable(physeq, group)} would return either a character vector or factor.
#' This is passed on to \code{\link[edgeR]{DGEList}},
#' and you may find further details or examples in its documentation.
#'
#' @param method (Optional). The label of the edgeR-implemented normalization to use.
#' See \code{\link[edgeR]{calcNormFactors}} for supported options and details.
#' The default option is \code{"RLE"}, which is a scaling factor method
#' proposed by Anders and Huber (2010).
#' At time of writing, the \link[edgeR]{edgeR} package supported
#' the following options to the \code{method} argument:
#'
#' \code{c("TMM", "RLE", "upperquartile", "none")}.
#'
#' @param ... Additional arguments passed on to \code{\link[edgeR]{DGEList}}
#'
#' @examples
#'
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
require("edgeR")
require("phyloseq")
# Enforce orientation.
if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
x = as(otu_table(physeq), "matrix")
# Add one to protect against overflow, log(0) issues.
x = x + 1
# Check `group` argument
if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
# Assume that group was a sample variable name (must be categorical)
group = get_variable(physeq, group)
}
# Define gene annotations (`genes`) as tax_table
taxonomy = tax_table(physeq, errorIfNULL=FALSE)
if( !is.null(taxonomy) ){
taxonomy = data.frame(as(taxonomy, "matrix"))
}
# Now turn into a DGEList
y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
# Calculate the normalization factors
z = calcNormFactors(y, method=method)
# Check for division by zero inside `calcNormFactors`
if( !all(is.finite(z$samples$norm.factors)) ){
stop("Something wrong with edgeR::calcNormFactors on this data,
non-finite $norm.factors, consider changing `method` argument")
}
# Estimate dispersions
return(estimateTagwiseDisp(estimateCommonDisp(z)))
}
################################################################################
If you find this extension or tutorial useful in your work, please cite the following:
McMurdie and Holmes (2014) Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible PLoS Computational Biology 10(4):e1003531.
McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE 8(4):e61217
Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140
date()
## [1] "Thu Jul 17 17:20:03 2014"
library("phyloseq"); packageVersion("phyloseq")
## [1] '1.9.11'
library("edgeR"); packageVersion("edgeR")
## [1] '3.6.7'
I use here the same example dataset as in the DESeq2 vignette, the publicly available data from a study on colorectal cancer:
Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.
This work was published ahead of print in Genome Research alongside a highly-related article from a separate group of researchers (hooray for reproducible observations!): Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. In case you are interested. For the purposes of example, however, we will stick to the data from the former study, with data available at the microbio.me/qiime server.
Study ID: 1457
Project Name: Kostic_colorectal_cancer_fusobacterium
Study Abstract: The tumor microenvironment of colorectal carcinoma is a complex community of genomically altered cancer cells, nonneoplastic cells, and a diverse collection of microorganisms. Each of these components may contribute to carcino genesis; however, the role of the microbiota is the least well understood. We have characterized the composition of the microbiota in colorectal carcinoma using whole genome sequences from nine tumor/normal pairs. Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA sequence analysis of 95 carcinoma/normal DNA pairs, while the Bacteroidetes and Firmicutes phyla were depleted in tumors. Fusobacteria were also visualized within colorectal tumors using FISH. These findings reveal alterations in the colorectal cancer microbiota; however, the precise role of Fusobacteria in colorectal carcinoma pathogenesis requires further investigation.
filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", package = "phyloseq")
kostic = microbio_me_qiime(filepath)
## Warning: NAs introduced by coercion
## Found biom-format file, now parsing it...
## Done parsing biom...
## Importing Sample Metdadata from mapping file...
## Merging the imported objects...
## Successfully merged, phyloseq-class created.
## Returning...
# Remove the samples for which the DIAGNOSIS was not included
kosticB = subset_samples(kostic, DIAGNOSIS != "None")
Independent filtering is done automatically in DESeq2, but not in DESeq. Here we will filter OTUs for which the variance across all samples is very low, and we’ll do this before ever passing the data to DESeq.
kosticB
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2505 taxa and 185 samples ]
## sample_data() Sample Data: [ 185 samples by 71 sample variables ]
## tax_table() Taxonomy Table: [ 2505 taxa by 7 taxonomic ranks ]
kosticp = transform_sample_counts(kosticB, function(x){x/sum(x)})
hist(log10(apply(otu_table(kosticp), 1, var)),
xlab="log10(variance)", breaks=50,
main="A large fraction of OTUs have very low variance")
varianceThreshold = 1e-5
keepOTUs = names(which(apply(otu_table(kosticp), 1, var) > varianceThreshold))
kosticB = prune_taxa(keepOTUs, kosticB)
kosticB
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 195 taxa and 185 samples ]
## sample_data() Sample Data: [ 185 samples by 71 sample variables ]
## tax_table() Taxonomy Table: [ 195 taxa by 7 taxonomic ranks ]
Here we’ve used an arbitrary but not-unreasonable variance threshold of 10-5. It is important to keep in mind that this filtering is independent of our downstream test. The sample classifications were not used.
Now let’s use our newly-defined function to convert the phyloseq data object kosticB
into an edgeR “DGE” data object, called dge
.
dge = phyloseq_to_edgeR(kosticB, group="DIAGNOSIS")
# Perform binary test
et = exactTest(dge)
# Extract values from test results
tt = topTags(et, n=nrow(dge$table), adjust.method="BH", sort.by="PValue")
res = tt@.Data[[1]]
alpha = 0.001
sigtab = res[(res$FDR < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kosticB)[rownames(sigtab), ], "matrix"))
dim(sigtab)
## [1] 63 18
head(sigtab)
## Kingdom Phylum Class Order
## 156697 Bacteria Fusobacteria Fusobacteria (class) Fusobacteriales
## 114860 Bacteria Firmicutes Bacilli Lactobacillales
## 211205 Bacteria Firmicutes Clostridia Clostridiales
## 2394 Bacteria Bacteroidetes Flavobacteria Flavobacteriales
## 38061 Bacteria Fusobacteria Fusobacteria (class) Fusobacteriales
## 49616 Bacteria Proteobacteria Gammaproteobacteria Pasteurellales
## Family Genus Species logFC
## 156697 Fusobacteriaceae Leptotrichia Leptotrichia trevisanii 4.370
## 114860 Streptococcaceae Streptococcus <NA> 3.830
## 211205 Lachnospiraceae <NA> <NA> -3.599
## 2394 Flavobacteriaceae Capnocytophaga Capnocytophaga ochracea 3.148
## 38061 Fusobacteriaceae Leptotrichia <NA> 3.554
## 49616 Pasteurellaceae Haemophilus Haemophilus influenzae 3.191
## logCPM PValue FDR Kingdom Phylum
## 156697 12.89 3.214e-27 6.268e-25 Bacteria Fusobacteria
## 114860 11.96 2.670e-22 2.604e-20 Bacteria Firmicutes
## 211205 11.58 1.178e-21 7.659e-20 Bacteria Firmicutes
## 2394 11.08 1.441e-20 7.023e-19 Bacteria Bacteroidetes
## 38061 12.06 4.865e-20 1.897e-18 Bacteria Fusobacteria
## 49616 11.55 9.065e-20 2.946e-18 Bacteria Proteobacteria
## Class Order Family
## 156697 Fusobacteria (class) Fusobacteriales Fusobacteriaceae
## 114860 Bacilli Lactobacillales Streptococcaceae
## 211205 Clostridia Clostridiales Lachnospiraceae
## 2394 Flavobacteria Flavobacteriales Flavobacteriaceae
## 38061 Fusobacteria (class) Fusobacteriales Fusobacteriaceae
## 49616 Gammaproteobacteria Pasteurellales Pasteurellaceae
## Genus Species
## 156697 Leptotrichia Leptotrichia trevisanii
## 114860 Streptococcus <NA>
## 211205 <NA> <NA>
## 2394 Capnocytophaga Capnocytophaga ochracea
## 38061 Leptotrichia <NA>
## 49616 Haemophilus Haemophilus influenzae
Here is a bar plot showing the log2-fold-change, showing Genus and Phylum. Uses some ggplot2 commands.
library("ggplot2"); packageVersion("ggplot2")
## [1] '1.0.0'
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
scale_fill_brewer(palette = palname, ...)
}
sigtabgen = subset(sigtab, !is.na(Genus))
# Phylum order
x = tapply(sigtabgen$logFC, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels = names(x))
# Genus order
x = tapply(sigtabgen$logFC, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels = names(x))
ggplot(sigtabgen, aes(x = Genus, y = logFC, color = Phylum)) + geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5))
As expected from the original study abstract and title, Fusobacterium OTUs were most-significantly differentially abundant between the cancerous and healthy samples. If you look closely, two different genera of the Fusobacteria phylum were among the most significantly different, Leptotrichia (the winner) as well as Fusobacterium.
As mentioned above, the design of this experiment is 95 carcinoma/normal pairs, where each pair comes from the same patient. Although the previous tests are valid, they are conservative in that they do not use this extra information regarding the sample-pairs, and in that sense have forfeited extra power. There is support in edgeR for paired tests, and this is officially described in one of the edgeR user guides. It is also demonstrated here in the following.
Diagnosis = get_variable(kosticB, "DIAGNOSIS")
Patient = get_variable(kosticB, "ANONYMIZED_NAME")
# Notice that we have some patients without one of the pairs.
tapply(Patient, Diagnosis, length)
## Healthy Tumor
## 95 90
length(levels(Patient))
## [1] 97
any(tapply(Diagnosis, Patient, length) > 2)
## [1] FALSE
sum(tapply(Diagnosis, Patient, length) < 2)
## [1] 9
# Keep only patients with both healthy and cancer samples
keepPatients = names(which(tapply(Diagnosis, Patient, function(x){length(unique(x))}) == 2))
kosticBpair = subset_samples(kosticB, ANONYMIZED_NAME %in% keepPatients)
Diagnosis = get_variable(kosticBpair, "DIAGNOSIS")
Patient = get_variable(kosticBpair, "ANONYMIZED_NAME")
# With that accounting out of the way, define the design matrix
design = model.matrix(~ Patient + Diagnosis)
Must estimate the dispersions, as always. This is different than in the function shown above.
# Add one to protect against overflow, log(0) issues.
x = as(otu_table(kosticBpair), "matrix") + 1L
taxonomy = data.frame(as(tax_table(kosticBpair), "matrix"))
# Now turn into a DGEList
x = DGEList(counts=x, group=Diagnosis, genes=taxonomy, remove.zeros=TRUE)
# Calculate the normalization factors and estimate dispersion
x = calcNormFactors(x, method="RLE")
x = estimateGLMCommonDisp(x, design)
x = estimateGLMTrendedDisp(x, design)
x = estimateGLMTagwiseDisp(x, design)
As in the edgeR User’s Guide, we proceed to fit a linear model and test for the treatment effect. Note that we can omit the coefficient argument to glmLRT
because the “treatment effect” (in this case the tissue diagnosis) is the last coeffcient in the model.
fit <- glmFit(x, design)
lrt <- glmLRT(fit)
topTags(lrt)
## Coefficient: DiagnosisTumor
## Kingdom Phylum Class Order
## 180285 Bacteria Firmicutes Clostridia Clostridiales
## 72853 Bacteria Firmicutes Clostridia Clostridiales
## 194648 Bacteria Firmicutes Clostridia Clostridiales
## 322235 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 174809 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 186029 Bacteria Actinobacteria Actinobacteria (class) Coriobacteriales
## 181056 Bacteria Firmicutes Clostridia Clostridiales
## 157566 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 248140 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 261912 Bacteria Firmicutes Clostridia Clostridiales
## Family Genus Species logFC
## 180285 Ruminococcaceae Faecalibacterium <NA> -1.633
## 72853 Ruminococcaceae Faecalibacterium <NA> -1.543
## 194648 <NA> <NA> <NA> -1.165
## 322235 Bacteroidaceae Bacteroides Bacteroides uniformis -1.186
## 174809 Bacteroidaceae Bacteroides <NA> -1.363
## 186029 Coriobacteriaceae Collinsella Collinsella aerofaciens -1.312
## 181056 Ruminococcaceae Faecalibacterium <NA> -1.408
## 157566 Rikenellaceae Alistipes <NA> -1.184
## 248140 Bacteroidaceae Bacteroides Bacteroides caccae -1.205
## 261912 Lachnospiraceae Dorea Dorea formicigenerans -1.019
## logCPM LR PValue FDR
## 180285 15.71 100.56 1.151e-23 2.245e-21
## 72853 13.22 91.30 1.232e-21 1.201e-19
## 194648 11.86 71.30 3.073e-17 1.997e-15
## 322235 13.30 61.57 4.283e-15 2.088e-13
## 174809 15.03 59.30 1.351e-14 5.269e-13
## 186029 11.75 57.26 3.822e-14 1.242e-12
## 181056 14.09 53.90 2.115e-13 5.890e-12
## 157566 13.71 51.04 9.073e-13 2.212e-11
## 248140 14.16 49.28 2.221e-12 4.381e-11
## 261912 12.58 49.26 2.247e-12 4.381e-11
This test detects OTUs that are differentially abundant in the tumor colon mucosa relative to the healthy colon mucosa (control), adjusting for baseline differences between the patients. This test can be viewed as a generalization of a paired t-test.
Re-make the plot.
respair = topTags(lrt, n=nrow(x), adjust.method="BH", sort.by="PValue")
respair = respair@.Data[[1]]
alpha = 0.001
sigtab = respair[(respair$FDR < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kosticB)[rownames(sigtab), ], "matrix"))
dim(sigtab)
## [1] 57 19
head(sigtab)
## Kingdom Phylum Class Order
## 180285 Bacteria Firmicutes Clostridia Clostridiales
## 72853 Bacteria Firmicutes Clostridia Clostridiales
## 194648 Bacteria Firmicutes Clostridia Clostridiales
## 322235 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 174809 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 186029 Bacteria Actinobacteria Actinobacteria (class) Coriobacteriales
## Family Genus Species logFC
## 180285 Ruminococcaceae Faecalibacterium <NA> -1.633
## 72853 Ruminococcaceae Faecalibacterium <NA> -1.543
## 194648 <NA> <NA> <NA> -1.165
## 322235 Bacteroidaceae Bacteroides Bacteroides uniformis -1.186
## 174809 Bacteroidaceae Bacteroides <NA> -1.363
## 186029 Coriobacteriaceae Collinsella Collinsella aerofaciens -1.312
## logCPM LR PValue FDR Kingdom Phylum
## 180285 15.71 100.56 1.151e-23 2.245e-21 Bacteria Firmicutes
## 72853 13.22 91.30 1.232e-21 1.201e-19 Bacteria Firmicutes
## 194648 11.86 71.30 3.073e-17 1.997e-15 Bacteria Firmicutes
## 322235 13.30 61.57 4.283e-15 2.088e-13 Bacteria Bacteroidetes
## 174809 15.03 59.30 1.351e-14 5.269e-13 Bacteria Bacteroidetes
## 186029 11.75 57.26 3.822e-14 1.242e-12 Bacteria Actinobacteria
## Class Order Family
## 180285 Clostridia Clostridiales Ruminococcaceae
## 72853 Clostridia Clostridiales Ruminococcaceae
## 194648 Clostridia Clostridiales <NA>
## 322235 Bacteroidia Bacteroidales Bacteroidaceae
## 174809 Bacteroidia Bacteroidales Bacteroidaceae
## 186029 Actinobacteria (class) Coriobacteriales Coriobacteriaceae
## Genus Species
## 180285 Faecalibacterium <NA>
## 72853 Faecalibacterium <NA>
## 194648 <NA> <NA>
## 322235 Bacteroides Bacteroides uniformis
## 174809 Bacteroides <NA>
## 186029 Collinsella Collinsella aerofaciens
sigtabgen = subset(sigtab, !is.na(Genus))
# Phylum order
x = tapply(sigtabgen$logFC, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels = names(x))
# Genus order
x = tapply(sigtabgen$logFC, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels = names(x))
ggplot(sigtabgen, aes(x = Genus, y = logFC, color = Phylum)) + geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
ggtitle("Log Fold Change of Significant OTUs in a Paired Test")
As a side note, there are many other interesting patient-sample covariates available in the sample_data
, which you can access with sample_data()
and get_variable()
.
sample_variables(kostic)
## [1] "X.SampleID" "BarcodeSequence"
## [3] "LinkerPrimerSequence" "NECROSIS_PERCENT"
## [5] "TARGET_SUBFRAGMENT" "ASSIGNED_FROM_GEO"
## [7] "EXPERIMENT_CENTER" "TITLE"
## [9] "RUN_PREFIX" "AGE"
## [11] "NORMAL_EQUIVALENT_PERCENT" "FIBROBLAST_AND_VESSEL_PERCENT"
## [13] "DEPTH" "TREATMENT"
## [15] "AGE_AT_DIAGNOSIS" "COMMON_NAME"
## [17] "HOST_COMMON_NAME" "BODY_SITE"
## [19] "ELEVATION" "REPORTS_RECEIVED"
## [21] "CEA" "PCR_PRIMERS"
## [23] "COLLECTION_DATE" "ALTITUDE"
## [25] "ENV_BIOME" "SEX"
## [27] "PLATFORM" "RACE"
## [29] "BSP_DIAGNOSIS" "STUDY_CENTER"
## [31] "COUNTRY" "CHEMOTHERAPY"
## [33] "YEAR_OF_DEATH" "ETHNICITY"
## [35] "ANONYMIZED_NAME" "TAXON_ID"
## [37] "SAMPLE_CENTER" "SAMP_SIZE"
## [39] "YEAR_OF_BIRTH" "ORIGINAL_DIAGNOSIS"
## [41] "AGE_UNIT" "STUDY_ID"
## [43] "EXPERIMENT_DESIGN_DESCRIPTION" "Description_duplicate"
## [45] "DIAGNOSIS" "BODY_HABITAT"
## [47] "SEQUENCING_METH" "RUN_DATE"
## [49] "HISTOLOGIC_GRADE" "LONGITUDE"
## [51] "ENV_MATTER" "TARGET_GENE"
## [53] "ENV_FEATURE" "KEY_SEQ"
## [55] "BODY_PRODUCT" "TUMOR_PERCENT"
## [57] "LIBRARY_CONSTRUCTION_PROTOCOL" "REGION"
## [59] "RUN_CENTER" "TUMOR_TYPE"
## [61] "BSP_NOTES" "RADIATION_THERAPY"
## [63] "INFLAMMATION_PERCENT" "HOST_SUBJECT_ID"
## [65] "PC3" "LATITUDE"
## [67] "OSH_DIAGNOSIS" "STAGE"
## [69] "PRIMARY_DISEASE" "HOST_TAXID"
## [71] "Description"