edgeR with phyloseq

Define the extension. Do not need to have anything loaded yet to do this.

################################################################################
#' Convert phyloseq OTU count data into DGEList for edgeR package
#' 
#' Further details.
#' 
#' @param physeq (Required).  A \code{\link{phyloseq-class}} or
#'  an \code{\link{otu_table-class}} object. 
#'  The latter is only appropriate if \code{group} argument is also a 
#'  vector or factor with length equal to \code{nsamples(physeq)}.
#'  
#' @param group (Required). A character vector or factor giving the experimental
#'  group/condition for each sample/library. Alternatively, you may provide
#'  the name of a sample variable. This name should be among the output of
#'  \code{sample_variables(physeq)}, in which case
#'  \code{get_variable(physeq, group)} would return either a character vector or factor.
#'  This is passed on to \code{\link[edgeR]{DGEList}},
#'  and you may find further details or examples in its documentation.
#'  
#' @param method (Optional). The label of the edgeR-implemented normalization to use.
#'  See \code{\link[edgeR]{calcNormFactors}} for supported options and details. 
#'  The default option is \code{"RLE"}, which is a scaling factor method 
#'  proposed by Anders and Huber (2010).
#'  At time of writing, the \link[edgeR]{edgeR} package supported 
#'  the following options to the \code{method} argument:
#'  
#'  \code{c("TMM", "RLE", "upperquartile", "none")}.
#'
#' @param ... Additional arguments passed on to \code{\link[edgeR]{DGEList}}
#' 
#' @examples
#' 
phyloseq_to_edgeR = function(physeq, group, method="RLE", ...){
  require("edgeR")
  require("phyloseq")
  # Enforce orientation.
  if( !taxa_are_rows(physeq) ){ physeq <- t(physeq) }
  x = as(otu_table(physeq), "matrix")
  # Add one to protect against overflow, log(0) issues.
  x = x + 1
  # Check `group` argument
  if( identical(all.equal(length(group), 1), TRUE) & nsamples(physeq) > 1 ){
    # Assume that group was a sample variable name (must be categorical)
    group = get_variable(physeq, group)
  }
  # Define gene annotations (`genes`) as tax_table
  taxonomy = tax_table(physeq, errorIfNULL=FALSE)
  if( !is.null(taxonomy) ){
    taxonomy = data.frame(as(taxonomy, "matrix"))
  } 
  # Now turn into a DGEList
  y = DGEList(counts=x, group=group, genes=taxonomy, remove.zeros = TRUE, ...)
  # Calculate the normalization factors
  z = calcNormFactors(y, method=method)
  # Check for division by zero inside `calcNormFactors`
  if( !all(is.finite(z$samples$norm.factors)) ){
    stop("Something wrong with edgeR::calcNormFactors on this data,
         non-finite $norm.factors, consider changing `method` argument")
  }
  # Estimate dispersions
  return(estimateTagwiseDisp(estimateCommonDisp(z)))
}
################################################################################

Citations

If you find this extension or tutorial useful in your work, please cite the following:

Differential Abundance for Microbiome Data

McMurdie and Holmes (2014) Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible PLoS Computational Biology 10(4):e1003531.

phyloseq

McMurdie and Holmes (2013) phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE 8(4):e61217

edgeR

Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140

Load example data and try it out

date()

## [1] "Thu Jul 17 17:20:03 2014"

library("phyloseq"); packageVersion("phyloseq")

## [1] '1.9.11'

library("edgeR"); packageVersion("edgeR")

## [1] '3.6.7'

I use here the same example dataset as in the DESeq2 vignette, the publicly available data from a study on colorectal cancer:

Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.

This work was published ahead of print in Genome Research alongside a highly-related article from a separate group of researchers (hooray for reproducible observations!): Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. In case you are interested. For the purposes of example, however, we will stick to the data from the former study, with data available at the microbio.me/qiime server.

Study ID: 1457

Project Name: Kostic_colorectal_cancer_fusobacterium

Study Abstract: The tumor microenvironment of colorectal carcinoma is a complex community of genomically altered cancer cells, nonneoplastic cells, and a diverse collection of microorganisms. Each of these components may contribute to carcino genesis; however, the role of the microbiota is the least well understood. We have characterized the composition of the microbiota in colorectal carcinoma using whole genome sequences from nine tumor/normal pairs. Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA sequence analysis of 95 carcinoma/normal DNA pairs, while the Bacteroidetes and Firmicutes phyla were depleted in tumors. Fusobacteria were also visualized within colorectal tumors using FISH. These findings reveal alterations in the colorectal cancer microbiota; however, the precise role of Fusobacteria in colorectal carcinoma pathogenesis requires further investigation.

filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", package = "phyloseq")
kostic = microbio_me_qiime(filepath)

## Warning: NAs introduced by coercion

## Found biom-format file, now parsing it... 
## Done parsing biom... 
## Importing Sample Metdadata from mapping file...
## Merging the imported objects... 
## Successfully merged, phyloseq-class created. 
##  Returning...

# Remove the samples for which the DIAGNOSIS was not included
kosticB = subset_samples(kostic, DIAGNOSIS != "None")

Independent filtering

Independent filtering is done automatically in DESeq2, but not in DESeq. Here we will filter OTUs for which the variance across all samples is very low, and we’ll do this before ever passing the data to DESeq.

kosticB

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2505 taxa and 185 samples ]
## sample_data() Sample Data:       [ 185 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]

kosticp = transform_sample_counts(kosticB, function(x){x/sum(x)})
hist(log10(apply(otu_table(kosticp), 1, var)),
     xlab="log10(variance)", breaks=50,
     main="A large fraction of OTUs have very low variance")

plot of chunk var-plot-ind-filter

varianceThreshold = 1e-5
keepOTUs = names(which(apply(otu_table(kosticp), 1, var) > varianceThreshold))
kosticB = prune_taxa(keepOTUs, kosticB)
kosticB

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 195 taxa and 185 samples ]
## sample_data() Sample Data:       [ 185 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 195 taxa by 7 taxonomic ranks ]

Here we’ve used an arbitrary but not-unreasonable variance threshold of 10^-5. It is important to keep in mind that this filtering is independent of our downstream test. The sample classifications were not used.

Now let’s use our newly-defined function to convert the phyloseq data object kosticB into an edgeR “DGE” data object, called dge.

dge = phyloseq_to_edgeR(kosticB, group="DIAGNOSIS")
# Perform binary test
et = exactTest(dge)
# Extract values from test results
tt = topTags(et, n=nrow(dge$table), adjust.method="BH", sort.by="PValue")
res = tt@.Data[[1]]
alpha = 0.001
sigtab = res[(res$FDR < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kosticB)[rownames(sigtab), ], "matrix"))
dim(sigtab)

## [1] 63 18

head(sigtab)

##         Kingdom         Phylum                Class            Order
## 156697 Bacteria   Fusobacteria Fusobacteria (class)  Fusobacteriales
## 114860 Bacteria     Firmicutes              Bacilli  Lactobacillales
## 211205 Bacteria     Firmicutes           Clostridia    Clostridiales
## 2394   Bacteria  Bacteroidetes        Flavobacteria Flavobacteriales
## 38061  Bacteria   Fusobacteria Fusobacteria (class)  Fusobacteriales
## 49616  Bacteria Proteobacteria  Gammaproteobacteria   Pasteurellales
##                   Family          Genus                 Species  logFC
## 156697  Fusobacteriaceae   Leptotrichia Leptotrichia trevisanii  4.370
## 114860  Streptococcaceae  Streptococcus                    <NA>  3.830
## 211205   Lachnospiraceae           <NA>                    <NA> -3.599
## 2394   Flavobacteriaceae Capnocytophaga Capnocytophaga ochracea  3.148
## 38061   Fusobacteriaceae   Leptotrichia                    <NA>  3.554
## 49616    Pasteurellaceae    Haemophilus  Haemophilus influenzae  3.191
##        logCPM    PValue       FDR  Kingdom         Phylum
## 156697  12.89 3.214e-27 6.268e-25 Bacteria   Fusobacteria
## 114860  11.96 2.670e-22 2.604e-20 Bacteria     Firmicutes
## 211205  11.58 1.178e-21 7.659e-20 Bacteria     Firmicutes
## 2394    11.08 1.441e-20 7.023e-19 Bacteria  Bacteroidetes
## 38061   12.06 4.865e-20 1.897e-18 Bacteria   Fusobacteria
## 49616   11.55 9.065e-20 2.946e-18 Bacteria Proteobacteria
##                       Class            Order            Family
## 156697 Fusobacteria (class)  Fusobacteriales  Fusobacteriaceae
## 114860              Bacilli  Lactobacillales  Streptococcaceae
## 211205           Clostridia    Clostridiales   Lachnospiraceae
## 2394          Flavobacteria Flavobacteriales Flavobacteriaceae
## 38061  Fusobacteria (class)  Fusobacteriales  Fusobacteriaceae
## 49616   Gammaproteobacteria   Pasteurellales   Pasteurellaceae
##                 Genus                 Species
## 156697   Leptotrichia Leptotrichia trevisanii
## 114860  Streptococcus                    <NA>
## 211205           <NA>                    <NA>
## 2394   Capnocytophaga Capnocytophaga ochracea
## 38061    Leptotrichia                    <NA>
## 49616     Haemophilus  Haemophilus influenzae

Here is a bar plot showing the log2-fold-change, showing Genus and Phylum. Uses some ggplot2 commands.

library("ggplot2"); packageVersion("ggplot2")

## [1] '1.0.0'

theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
sigtabgen = subset(sigtab, !is.na(Genus))
# Phylum order
x = tapply(sigtabgen$logFC, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels = names(x))
# Genus order
x = tapply(sigtabgen$logFC, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels = names(x))
ggplot(sigtabgen, aes(x = Genus, y = logFC, color = Phylum)) + geom_point(size=6) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5))

plot of chunk unnamed-chunk-4

As expected from the original study abstract and title, Fusobacterium OTUs were most-significantly differentially abundant between the cancerous and healthy samples. If you look closely, two different genera of the Fusobacteria phylum were among the most significantly different, Leptotrichia (the winner) as well as Fusobacterium.

Paired tests

As mentioned above, the design of this experiment is 95 carcinoma/normal pairs, where each pair comes from the same patient. Although the previous tests are valid, they are conservative in that they do not use this extra information regarding the sample-pairs, and in that sense have forfeited extra power. There is support in edgeR for paired tests, and this is officially described in one of the edgeR user guides. It is also demonstrated here in the following.

Diagnosis = get_variable(kosticB, "DIAGNOSIS")
Patient = get_variable(kosticB, "ANONYMIZED_NAME")
# Notice that we have some patients without one of the pairs.
tapply(Patient, Diagnosis, length)

## Healthy   Tumor 
##      95      90

length(levels(Patient))

## [1] 97

any(tapply(Diagnosis, Patient, length) > 2)

## [1] FALSE

sum(tapply(Diagnosis, Patient, length) < 2)

## [1] 9

# Keep only patients with both healthy and cancer samples
keepPatients = names(which(tapply(Diagnosis, Patient, function(x){length(unique(x))}) == 2))
kosticBpair = subset_samples(kosticB, ANONYMIZED_NAME %in% keepPatients)
Diagnosis = get_variable(kosticBpair, "DIAGNOSIS")
Patient = get_variable(kosticBpair, "ANONYMIZED_NAME")
# With that accounting out of the way, define the design matrix
design = model.matrix(~ Patient + Diagnosis)

Must estimate the dispersions, as always. This is different than in the function shown above.

# Add one to protect against overflow, log(0) issues.
x = as(otu_table(kosticBpair), "matrix") + 1L
taxonomy = data.frame(as(tax_table(kosticBpair), "matrix"))
# Now turn into a DGEList
x = DGEList(counts=x, group=Diagnosis, genes=taxonomy, remove.zeros=TRUE)
# Calculate the normalization factors and estimate dispersion
x = calcNormFactors(x, method="RLE")
x = estimateGLMCommonDisp(x, design)
x = estimateGLMTrendedDisp(x, design)
x = estimateGLMTagwiseDisp(x, design)

As in the edgeR User’s Guide, we proceed to fit a linear model and test for the treatment effect. Note that we can omit the coefficient argument to glmLRT because the “treatment effect” (in this case the tissue diagnosis) is the last coeffcient in the model.

fit <- glmFit(x, design)
lrt <- glmLRT(fit)
topTags(lrt)

## Coefficient:  DiagnosisTumor 
##         Kingdom         Phylum                  Class            Order
## 180285 Bacteria     Firmicutes             Clostridia    Clostridiales
## 72853  Bacteria     Firmicutes             Clostridia    Clostridiales
## 194648 Bacteria     Firmicutes             Clostridia    Clostridiales
## 322235 Bacteria  Bacteroidetes            Bacteroidia    Bacteroidales
## 174809 Bacteria  Bacteroidetes            Bacteroidia    Bacteroidales
## 186029 Bacteria Actinobacteria Actinobacteria (class) Coriobacteriales
## 181056 Bacteria     Firmicutes             Clostridia    Clostridiales
## 157566 Bacteria  Bacteroidetes            Bacteroidia    Bacteroidales
## 248140 Bacteria  Bacteroidetes            Bacteroidia    Bacteroidales
## 261912 Bacteria     Firmicutes             Clostridia    Clostridiales
##                   Family            Genus                 Species  logFC
## 180285   Ruminococcaceae Faecalibacterium                    <NA> -1.633
## 72853    Ruminococcaceae Faecalibacterium                    <NA> -1.543
## 194648              <NA>             <NA>                    <NA> -1.165
## 322235    Bacteroidaceae      Bacteroides   Bacteroides uniformis -1.186
## 174809    Bacteroidaceae      Bacteroides                    <NA> -1.363
## 186029 Coriobacteriaceae      Collinsella Collinsella aerofaciens -1.312
## 181056   Ruminococcaceae Faecalibacterium                    <NA> -1.408
## 157566     Rikenellaceae        Alistipes                    <NA> -1.184
## 248140    Bacteroidaceae      Bacteroides      Bacteroides caccae -1.205
## 261912   Lachnospiraceae            Dorea   Dorea formicigenerans -1.019
##        logCPM     LR    PValue       FDR
## 180285  15.71 100.56 1.151e-23 2.245e-21
## 72853   13.22  91.30 1.232e-21 1.201e-19
## 194648  11.86  71.30 3.073e-17 1.997e-15
## 322235  13.30  61.57 4.283e-15 2.088e-13
## 174809  15.03  59.30 1.351e-14 5.269e-13
## 186029  11.75  57.26 3.822e-14 1.242e-12
## 181056  14.09  53.90 2.115e-13 5.890e-12
## 157566  13.71  51.04 9.073e-13 2.212e-11
## 248140  14.16  49.28 2.221e-12 4.381e-11
## 261912  12.58  49.26 2.247e-12 4.381e-11

This test detects OTUs that are differentially abundant in the tumor colon mucosa relative to the healthy colon mucosa (control), adjusting for baseline differences between the patients. This test can be viewed as a generalization of a paired t-test.

Re-make the plot.

respair = topTags(lrt, n=nrow(x), adjust.method="BH", sort.by="PValue")
respair = respair@.Data[[1]]
alpha = 0.001
sigtab = respair[(respair$FDR < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kosticB)[rownames(sigtab), ], "matrix"))
dim(sigtab)

## [1] 57 19

head(sigtab)

##         Kingdom         Phylum                  Class            Order
## 180285 Bacteria     Firmicutes             Clostridia    Clostridiales
## 72853  Bacteria     Firmicutes             Clostridia    Clostridiales
## 194648 Bacteria     Firmicutes             Clostridia    Clostridiales
## 322235 Bacteria  Bacteroidetes            Bacteroidia    Bacteroidales
## 174809 Bacteria  Bacteroidetes            Bacteroidia    Bacteroidales
## 186029 Bacteria Actinobacteria Actinobacteria (class) Coriobacteriales
##                   Family            Genus                 Species  logFC
## 180285   Ruminococcaceae Faecalibacterium                    <NA> -1.633
## 72853    Ruminococcaceae Faecalibacterium                    <NA> -1.543
## 194648              <NA>             <NA>                    <NA> -1.165
## 322235    Bacteroidaceae      Bacteroides   Bacteroides uniformis -1.186
## 174809    Bacteroidaceae      Bacteroides                    <NA> -1.363
## 186029 Coriobacteriaceae      Collinsella Collinsella aerofaciens -1.312
##        logCPM     LR    PValue       FDR  Kingdom         Phylum
## 180285  15.71 100.56 1.151e-23 2.245e-21 Bacteria     Firmicutes
## 72853   13.22  91.30 1.232e-21 1.201e-19 Bacteria     Firmicutes
## 194648  11.86  71.30 3.073e-17 1.997e-15 Bacteria     Firmicutes
## 322235  13.30  61.57 4.283e-15 2.088e-13 Bacteria  Bacteroidetes
## 174809  15.03  59.30 1.351e-14 5.269e-13 Bacteria  Bacteroidetes
## 186029  11.75  57.26 3.822e-14 1.242e-12 Bacteria Actinobacteria
##                         Class            Order            Family
## 180285             Clostridia    Clostridiales   Ruminococcaceae
## 72853              Clostridia    Clostridiales   Ruminococcaceae
## 194648             Clostridia    Clostridiales              <NA>
## 322235            Bacteroidia    Bacteroidales    Bacteroidaceae
## 174809            Bacteroidia    Bacteroidales    Bacteroidaceae
## 186029 Actinobacteria (class) Coriobacteriales Coriobacteriaceae
##                   Genus                 Species
## 180285 Faecalibacterium                    <NA>
## 72853  Faecalibacterium                    <NA>
## 194648             <NA>                    <NA>
## 322235      Bacteroides   Bacteroides uniformis
## 174809      Bacteroides                    <NA>
## 186029      Collinsella Collinsella aerofaciens

sigtabgen = subset(sigtab, !is.na(Genus))
# Phylum order
x = tapply(sigtabgen$logFC, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels = names(x))
# Genus order
x = tapply(sigtabgen$logFC, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels = names(x))
ggplot(sigtabgen, aes(x = Genus, y = logFC, color = Phylum)) + geom_point(size=6) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
  ggtitle("Log Fold Change of Significant OTUs in a Paired Test")

plot of chunk paired-ggplot

Other covariates available

As a side note, there are many other interesting patient-sample covariates available in the sample_data, which you can access with sample_data() and get_variable().

sample_variables(kostic)

##  [1] "X.SampleID"                    "BarcodeSequence"              
##  [3] "LinkerPrimerSequence"          "NECROSIS_PERCENT"             
##  [5] "TARGET_SUBFRAGMENT"            "ASSIGNED_FROM_GEO"            
##  [7] "EXPERIMENT_CENTER"             "TITLE"                        
##  [9] "RUN_PREFIX"                    "AGE"                          
## [11] "NORMAL_EQUIVALENT_PERCENT"     "FIBROBLAST_AND_VESSEL_PERCENT"
## [13] "DEPTH"                         "TREATMENT"                    
## [15] "AGE_AT_DIAGNOSIS"              "COMMON_NAME"                  
## [17] "HOST_COMMON_NAME"              "BODY_SITE"                    
## [19] "ELEVATION"                     "REPORTS_RECEIVED"             
## [21] "CEA"                           "PCR_PRIMERS"                  
## [23] "COLLECTION_DATE"               "ALTITUDE"                     
## [25] "ENV_BIOME"                     "SEX"                          
## [27] "PLATFORM"                      "RACE"                         
## [29] "BSP_DIAGNOSIS"                 "STUDY_CENTER"                 
## [31] "COUNTRY"                       "CHEMOTHERAPY"                 
## [33] "YEAR_OF_DEATH"                 "ETHNICITY"                    
## [35] "ANONYMIZED_NAME"               "TAXON_ID"                     
## [37] "SAMPLE_CENTER"                 "SAMP_SIZE"                    
## [39] "YEAR_OF_BIRTH"                 "ORIGINAL_DIAGNOSIS"           
## [41] "AGE_UNIT"                      "STUDY_ID"                     
## [43] "EXPERIMENT_DESIGN_DESCRIPTION" "Description_duplicate"        
## [45] "DIAGNOSIS"                     "BODY_HABITAT"                 
## [47] "SEQUENCING_METH"               "RUN_DATE"                     
## [49] "HISTOLOGIC_GRADE"              "LONGITUDE"                    
## [51] "ENV_MATTER"                    "TARGET_GENE"                  
## [53] "ENV_FEATURE"                   "KEY_SEQ"                      
## [55] "BODY_PRODUCT"                  "TUMOR_PERCENT"                
## [57] "LIBRARY_CONSTRUCTION_PROTOCOL" "REGION"                       
## [59] "RUN_CENTER"                    "TUMOR_TYPE"                   
## [61] "BSP_NOTES"                     "RADIATION_THERAPY"            
## [63] "INFLAMMATION_PERCENT"          "HOST_SUBJECT_ID"              
## [65] "PC3"                           "LATITUDE"                     
## [67] "OSH_DIAGNOSIS"                 "STAGE"                        
## [69] "PRIMARY_DISEASE"               "HOST_TAXID"                   
## [71] "Description"

edgeR with phyloseq

Citations

Differential Abundance for Microbiome Data

phyloseq

edgeR

Load example data and try it out

Independent filtering

Paired tests

Other covariates available

Other extensions for the phyloseq package:

DESeq

DESeq2

edgeR

extensions-index

index