Demo: phyloseq – A Bioconductor package for handling and analysis of high-throughput phylogenetic sequence data

Paul J. McMurdie and Susan Holmes

Statistics Department, Stanford University,

Stanford, CA 94305, USA

E-mail

mcmurdie@stanford.edu

susan@stat.stanford.edu

Websites

joey711.github.com/phyloseq

http://www-stat.stanford.edu/~susan/

Summary and Other Documentation Resources

This document supports a live demonstration of tools in the phyloseq package, and supplements other documentation resources available for the phyloseq package (e.g. wiki, vignettes, publications, function-level documentation, etc.). It is built automatically from its R-markdown source file with example code-chunks that can be reused if you have phyloseq installed. The R-markdown file and other related materials are publicly available, and hosted on GitHub with an explicit open-access license/ copyright statement:

Demo Materials Available Here

Other Package-Level Documentation

As mentioned, vignettes are included in phyloseq. A quick way to load them from within R is:

vignette("phyloseq_basics")
vignette("phyloseq_analysis")

The phyloseq Wiki

There is also a GitHub-hosted wiki for phyloseq. Some of these pages contain more detailed examples of the topics/tasks covered in this demonstration.

The phyloseq Issue Tracker

There is also a GitHub-hosted issue-tracker for phyloseq, currently describing over 100 feature requests, bug reports, documentation revisions, help requests, and other suggestions. Only a small fraction of these issues are still outstanding, but the descriptions remain available for anyone to view and add comments/code.

Installation

The phyloseq package is under active development. Users are encouraged to consistently update their version from the phyloseq development website on GitHub. The most stable releases and development versions of phyloseq are hosted by Bioconductor. For installing from Bioconductor, and alternatives to "the bleeding edge", as well as the most updated installation news and instructions, please see the installation wiki page.

Load phyloseq, and Import Data.

Of course we need to start this tutorial by loading the phyloseq package. This assumes you have already installed phyloseq.

library("phyloseq")

There is package-level documentation available. Note the following difference:

help("phyloseq-package")
help("phyloseq")

The latter loads instead the documentation for the constructor function named phyloseq()

Basic Data Import

Importing the Output from QIIME

otufile <- system.file("extdata", "GP_otu_table_rand_short.txt.gz", package = "phyloseq")
mapfile <- system.file("extdata", "master_map.txt", package = "phyloseq")
trefile <- system.file("extdata", "GP_tree_rand_short.newick.gz", package = "phyloseq")
qiimex <- import_qiime(otufile, mapfile, trefile, showProgress = FALSE)
print(qiimex)
## phyloseq-class experiment-level object
## OTU Table:          [500 taxa and 26 samples]
##                      taxa are rows
## Sample Data:         [26 samples by 7 sample variables]:
## Taxonomy Table:     [500 taxa by 7 taxonomic ranks]:
## Phylogenetic Tree:  [500 tips and 499 internal nodes]
##                      rooted

Importing the Output from mothur

Example: Manually re-import the "esophagus dataset", which is already inclduded in the phyloseq package. Note that the esophagus dataset is a simple dataset consisting of just 3 samples and a relatively small richness, described with just a tree and OTU table. It is most useful for showing quick examples, such as computing the UniFrac distance, but for not much else. If interested, further details can be found by entering ?esophagus.

mothlist <- system.file("extdata", "esophagus.fn.list.gz", package = "phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package = "phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package = "phyloseq")
cutoff <- "0.10"
esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)
print(esophman)
## phyloseq-class experiment-level object
## OTU Table:          [58 taxa and 3 samples]
##                      taxa are rows
## Phylogenetic Tree:  [58 tips and 57 internal nodes]
##                      rooted
ntaxa(esophman)
## [1] 58

Let's test if they are identical as expected

data(esophagus)
identical(esophagus, esophman)
## [1] TRUE

Importing biom-format files

The biom-format is intended to be a complete representation of the OTU-clustering results, and so import can be performed with just one connection/file path.

rich_sparse_biom <- system.file("extdata", "rich_sparse_otu_table.biom", package = "phyloseq")
rich_sparse <- import_biom(rich_sparse_biom, taxaPrefix = "greengenes")
print(rich_sparse)
## phyloseq-class experiment-level object
## OTU Table:          [5 taxa and 6 samples]
##                      taxa are rows
## Sample Data:         [6 samples by 4 sample variables]:
## Taxonomy Table:     [5 taxa by 7 taxonomic ranks]:

Note: the current biom-format definition lacks an phylogenetic tree. I am working with the biom-format team on including a phylogenetic tree in the next version of the format, and have contributed a preliminary R package for biom-format I/O support as a pull-request to the biom-format development page on GitHub.

The biom-format definition allows for both sparse and dense representations of the abundance data, and is also flexible enough to allow a "minimal" (abundance table onle) and "rich" forms (includes sample and taxonomy data). All of these forms are supported and automatically recognized/interpreted in phyloseq through the import_biom function.

# Define file path to all four format combinations
rich_dense_biom <- system.file("extdata", "rich_dense_otu_table.biom", package = "phyloseq")
rich_sparse_biom <- system.file("extdata", "rich_sparse_otu_table.biom", package = "phyloseq")
min_dense_biom <- system.file("extdata", "min_dense_otu_table.biom", package = "phyloseq")
min_sparse_biom <- system.file("extdata", "min_sparse_otu_table.biom", package = "phyloseq")
# Each import is only one line, we will import four different example
# 'datasets'
rich_dense <- import_biom(rich_dense_biom, taxaPrefix = "greengenes")
rich_sparse <- import_biom(rich_sparse_biom, taxaPrefix = "greengenes")
min_dense <- import_biom(min_dense_biom, taxaPrefix = "greengenes")
min_sparse <- import_biom(min_sparse_biom, taxaPrefix = "greengenes")
# print summary if phyloseq, the component class otherwise.
biom_ex_print <- function(i) {
    if (class(i) != "phyloseq") {
        class(i)
    } else {
        i
    }
}
sapply(list(rich_dense, rich_sparse, min_dense, min_sparse), biom_ex_print)
## [[1]]
## phyloseq-class experiment-level object
## OTU Table:          [5 taxa and 6 samples]
##                      taxa are rows
## Sample Data:         [6 samples by 4 sample variables]:
## Taxonomy Table:     [5 taxa by 7 taxonomic ranks]:
## 
## [[2]]
## phyloseq-class experiment-level object
## OTU Table:          [5 taxa and 6 samples]
##                      taxa are rows
## Sample Data:         [6 samples by 4 sample variables]:
## Taxonomy Table:     [5 taxa by 7 taxonomic ranks]:
## 
## [[3]]
## [1] "otu_table"
## 
## [[4]]
## [1] "otu_table"
## 

A More Complicated Import Example

One of the example datasets included in the phyloseq package is derived from the study first describing human microbiome "Enterotypes", and that dataset is called simply enterotype. It will be called in later examples using the data command.

A more recent study investigating human microbiome "Enterotypes" is titled Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes by Wu et al., Science, 334 (6052), 105–108. One of the three corresponding authors has the last name "Bushman", which also happens to be the title of the QIIME-processed version of this dataset at the microbio.me/qiime database.

We will import this data to illustrate a more complicated situation in which we need to import 3 different data components from two different file types (one is biom-format, the other is sample data contained in a tab-delimited "Mapping File" format produced by QIIME.

For convenience and stability, these "Bushman" data files were saved locally and imported from their local system location. An even more complicated "direct import" example is provided in the last section ("Extra Example") below, but produces the same result and is not run by the embedded code.

Import Bushman data, already downloaded.

uzdir <- "/Volumes/media/Research/study_1011_split_library_seqs_and_mapping/"
biom_file <- paste(uzdir, "study_1011_closed_reference_otu_table.biom", sep = "")
map_file <- paste(uzdir, "study_1011_mapping_file.txt", sep = "")
# Now import the .biom-formatted otu_table-tax_table file.
biom_otu_tax <- import_biom(biom_file, "greengenes")
# Add sample data to the dataset using merge
bmsd <- import_qiime_sample_data(map_file)
class(bmsd)
## [1] "sample_data"
## attr(,"package")
## [1] "phyloseq"
dim(bmsd)
## [1] 102 225
biom_otu_tax
## phyloseq-class experiment-level object
## OTU Table:          [1873 taxa and 100 samples]
##                      taxa are rows
## Taxonomy Table:     [1873 taxa by 7 taxonomic ranks]:

Merging datasets or components

We need to merge these two separate Bushman dataset objects into one "phyloseq" object. Presently, the two data objects contain the otu_table, taxonomyTable, and sample_data components, respectively. If we had three objects that were all components (think single tables, or a tree), then we would use the constructor function, phyloseq. However, because the .biom file contained two tables (including an otu_table), the import_biom function returned a valid "phyloseq-class" instance instead that contained both components. Whenever you need to add or merge data componentes from one (or more) phyloseq-class objects, the merging function, merge_phyloseq, is recommended, rather than the constructor (phyloseq).

Bushman <- merge_phyloseq(biom_otu_tax, bmsd)

Extra Example: the Human Microbiome Project

Import the HMP-v35 Dataset

This is an example importing into phyloseq the files produced by Qiime after being run on the Human Microbiome Project's v35 dataset, which is avilable from HMP-DACC.

This takes about 35 minutes on a laptop, and we are providing the resulting phyloseq-formatted result as an .RData file, so that you do not have to repeat the process. See the wiki page devoted to importing the HMPv35 dataset into phyloseq

Extra Example: Direct ftp Download, Unzip, and Import

The .biom and sample data files are also provided online (ftp), and a useful way to download and import into phyloseq directly from the ftp address in the following example code. This is an example in which we download a zip file with both biom- and qiime-formatted data, unzip it in a temporary directory from with in R, import the relavant files using phyloseq importers, and then delete the temporary files. This code should be platform independent, but occasionally there are finicky Windows issues that arise.

(Note: this is not actually run in this demo. Would be redundant, and occasionally Windows issues might crash it, based on experience.)

# zipftp <-
# 'ftp://thebeast.colorado.edu/pub/QIIME_DB_Public_Studies/study_1011_split_library_seqs_and_mapping.zip'
# # First create a temporary directory in which to store the unpacked
# file(s) from the .zip tmpdir <- tempdir() # Second create a temp file
# where you will put the .zip-file itself temp <- tempfile() # Now
# download the file and unzip to tmpdir directory download.file(zipftp,
# temp) unzip(temp, exdir = tmpdir ) # Define the biom file-path biom_file
# <- file.path(tmpdir, list.files(tmpdir, pattern='.biom')) # Define the
# mapping file-path map_file <- file.path(tmpdir, list.files(tmpdir,
# pattern='mapping')) # Now import the .biom-formatted
# otu_table/taxonomyTable file.  biom_otu_tax <- import_biom(biom_file,
# 'greengenes') # Add sample data to the dataset using merge bmsd <-
# import_qiime_sample_data(map_file) # Remove the temperorary file and
# directory where you unpacked the zip files unlink(temp) unlink(tmpdir)

Basic Interaction with phyloseq Data

Let's look at some basic print and accessor functions/methods provided in phyloseq.

Bushman
## phyloseq-class experiment-level object
## OTU Table:          [1873 taxa and 100 samples]
##                      taxa are rows
## Sample Data:         [100 samples by 225 sample variables]:
## Taxonomy Table:     [1873 taxa by 7 taxonomic ranks]:

Convenience accessors

ntaxa(Bushman)
## [1] 1873
nsamples(Bushman)
## [1] 100
sample_names(Bushman)[1:10]
##  [1] "C.3075.01.S1.405156" "C.3068.01.S1.405137" "C.3003.01.P1.405142"
##  [4] "C.4007.01.P1.405187" "C.4001.01.P1.405188" "C.3067.01.S1.405183"
##  [7] "C.3043.01.S1.405217" "C.3006.01.P1.405164" "C.3078.01.S1.405181"
## [10] "C.3086.01.S1.405130"
taxa_names(Bushman)[1:10]
##  [1] "248563" "110059" "223351" "358030" "367581" "16076"  "313844"
##  [8] "49837"  "517282" "296166"

Interacting with the sample variables

This is useful later in plotting

sample_variables(Bushman)[1:10]
##  [1] "X.SampleID"                  "BarcodeSequence"            
##  [3] "LinkerPrimerSequence"        "ASSIGNED_FROM_GEO"          
##  [5] "ASPARTAME_MG_AVE"            "TOT_CONJUGLINOLEICA_G_AVE"  
##  [7] "AGE"                         "VITC_ASCORBIC_ACID_MG_AVE"  
##  [9] "OXALIC_ACID_MG_AVE"          "PUFA_EICOSAPENTAENOIC_G_AVE"
length(sample_variables(Bushman))
## [1] 225

How can we look at the values for a particular variable? (Here I've arbitrarily chose the name of the 5th sample variable)

get_variable(Bushman, sample_variables(Bushman)[5])
##   [1] 267.759   0.000   0.000   0.000   0.000   0.000   0.000   0.000
##   [9]  67.961   1.167 253.080   0.000   0.000  10.263   0.000 472.708
##  [17]  60.739   0.000 178.793 282.437   0.000  94.628   0.000   0.000
##  [25]  62.989   0.000   0.000  16.990   0.000   0.000   0.000   0.000
##  [33] 113.269  60.739   0.000  33.981   0.000   0.000   0.000   0.000
##  [41]   0.000 107.208   0.000   0.000   0.000   0.000   0.000  58.875
##  [49]   0.000   0.000   0.000 257.124   0.000   1.771   0.000   0.000
##  [57]   0.000   0.000   0.000   0.000   0.000   0.000 217.392   0.000
##  [65]   0.000  31.671   0.000 161.971   0.000   0.000   0.000   0.000
##  [73]  33.981   0.000  60.739   0.000   0.000   0.000 176.812  51.820
##  [81]   0.000 110.667 157.867   0.000  34.100   0.000   0.000   0.000
##  [89]   1.167   0.000 830.102   0.000  60.739   0.000   0.000   0.000
##  [97]   0.000   0.000   0.000  70.496

Interacting with the taxonomic ranks

This is useful later in plotting

rank_names(Bushman)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
get_taxa_unique(Bushman, "Phylum")
##  [1] "Firmicutes"      "Bacteroidetes"   "Tenericutes"    
##  [4] "Proteobacteria"  "Cyanobacteria"   "Actinobacteria" 
##  [7] "Lentisphaerae"   "Fusobacteria"    "TM7"            
## [10] "Verrucomicrobia" "Synergistetes"  

What if we wanted to change the rank_names returned in the previous chunk? Since we know (in this case) that the "greengenes" taxonomy tools were used for the assignment we actually then know the names of the seven taxonomic ranks used. Just assign these to the colnames of the Bushman taxonomy table.

colnames(tax_table(Bushman)) <- c(k = "Kingdom", p = "Phylum", c = "Class", 
    o = "Order", f = "Family", g = "Genus", s = "Species")

Abundance Accessors

The purposes of the sample_sums and taxa_sums function are pretty straightforward, but the get_sample and get_taxa functions can bet a bit confusing.

sample_sums(Bushman)[1:10]
## C.3075.01.S1.405156 C.3068.01.S1.405137 C.3003.01.P1.405142 
##               13919               10488                6776 
## C.4007.01.P1.405187 C.4001.01.P1.405188 C.3067.01.S1.405183 
##                2098                3911               11690 
## C.3043.01.S1.405217 C.3006.01.P1.405164 C.3078.01.S1.405181 
##               11825                6748               14934 
## C.3086.01.S1.405130 
##               13554 
taxa_sums(Bushman)[1:10]
## 248563 110059 223351 358030 367581  16076 313844  49837 517282 296166 
##     53     71    692     90  15559   1219      3     18      4      1 
get_taxa(Bushman, sample_names(Bushman)[5])[1:10]
## 248563 110059 223351 358030 367581  16076 313844  49837 517282 296166 
##      0      0      0      0      0      0      0      0      0      0 
get_sample(Bushman, taxa_names(Bushman)[5])[1:10]
## C.3075.01.S1.405156 C.3068.01.S1.405137 C.3003.01.P1.405142 
##                 100                 337                 104 
## C.4007.01.P1.405187 C.4001.01.P1.405188 C.3067.01.S1.405183 
##                  64                   0                 374 
## C.3043.01.S1.405217 C.3006.01.P1.405164 C.3078.01.S1.405181 
##                   0                   0                 123 
## C.3086.01.S1.405130 
##                   9 

Note how a sample name is required by get_taxa, and vice versa. This might seem confusing at first, but get_taxa is returning all the OTU abundances from one sample, while get_sample is returning the abundances from all samples for one OTU.

Simple Summary Graphics

Load additional graphics-related packages

library("ggplot2")
library("scales")
library("grid")

Some examples for plotting richness estimates from un-trimmed data.

plot_richness_estimates(Bushman)
plot of chunk plot-richness-1

plot of chunk plot-richness-1

(p <- plot_richness_estimates(Bushman, x = "SEX"))
plot of chunk plot-richness-1

plot of chunk plot-richness-1

p + geom_boxplot(data = p$data, aes(x = SEX, y = value, color = NULL), alpha = 0.1)
plot of chunk plot-richness-1

plot of chunk plot-richness-1

Some others you might try (not run in demo)

plot_richness_estimates(Bushman, x = "INSOLUBLE_DIETARY_FIBER_G_AVE")
plot_richness_estimates(Bushman, x = "AGE_IN_YEARS")
plot_richness_estimates(Bushman, x = "VEGETABLE_PROTEIN_G_AVE")

Plotting an Annotated Phylogenetic Tree

A useful display on a phylogenetic tree is to add points next to tips/leaves/OTUs to represent samples in which the OTU was observed. This is facilitated in the phyloseq package through the plot_tree function, which produces a ggplot-based phylogenetic tree, and also allows several options for mapping color, shape, and size of these sample points to variables in the dataset. These point aesthetics can be mapped to sample data or to taxonomic data, depending on needs and which information needs to be reinforced in your graphic.

Caution: Trying to plot too many taxa (tree tips) at once obscures meaning. Let's look at just the Chlamydiae phylum in the incldued GlobalPatterns dataset. Note that this also requires subsetting the GlobalPatterns dataset using the subset_taxa function, part of the "preprocessing" tools described in the following section.

data(GlobalPatterns)
GlobalPatterns
## phyloseq-class experiment-level object
## OTU Table:          [19216 taxa and 26 samples]
##                      taxa are rows
## Sample Data:         [26 samples by 7 sample variables]:
## Taxonomy Table:     [19216 taxa by 7 taxonomic ranks]:
## Phylogenetic Tree:  [19216 tips and 19215 internal nodes]
##                      rooted
GP.chl <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")

Map the sample environment ("SampleType") to point color, and taxonomic family to point shape. Additionally, label the tips with the Genus name and scale the point size by abundance

plot_tree(GP.chl, color = "SampleType", shape = "Family", label.tips = "Genus", 
    size = "abundance")
plot of chunk plot_tree-chlam

plot of chunk plot_tree-chlam

This figure contains the graphic produced by the plot_tree function in phyloseq. In this case the data is a subset of the GlobalPatterns dataset in which only OTUs from the phylum Chlamydiae are included. Additionally, the tree has been annotated with genus labels at each tree tip. The points next to tips represent samples in which the OTU was observed, and are shaped according to taxonomic rank of Family, and shaded according to the sample type (sample source environment).

plot_tree(GP.chl, "treeonly")
plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-11

This figure is the result of plotting the "bare" tree with no options directing the mapping of a variable to the tree, and "treeonly" as the argument to method. Not as informative as the previous tree.

Abundance bar plots

For direct quantitative observation/comparison of abundances.

data(enterotype)
TopNOTUs <- names(sort(taxa_sums(enterotype), TRUE)[1:10])
ent10 <- prune_taxa(TopNOTUs, enterotype)
plot_taxa_bar(ent10, "Genus", x = "SeqTech", fill = "TaxaGroup")
plot of chunk plot_taxa_bar-ex1

plot of chunk plot_taxa_bar-ex1

plot_taxa_bar(ent10, "Genus", x = "SeqTech", fill = "TaxaGroup") + facet_wrap(~Enterotype)
plot of chunk unnamed-chunk-12

plot of chunk unnamed-chunk-12

This one takes a little while to calculate (not run).

plot_taxa_bar(Bushman, "Phylum", NULL, 0.9, "SEX", "INSOLUBLE_DIETARY_FIBER_G_AVE")

Preprocessing Abundance Data

This section includes examples preprocessing (filtering, trimming, subsetting, etc) phyloseq data. Let's start by resetting the GlobalPatterns example data and adding a human category.

data(GlobalPatterns)
GlobalPatterns
## phyloseq-class experiment-level object
## OTU Table:          [19216 taxa and 26 samples]
##                      taxa are rows
## Sample Data:         [26 samples by 7 sample variables]:
## Taxonomy Table:     [19216 taxa by 7 taxonomic ranks]:
## Phylogenetic Tree:  [19216 tips and 19215 internal nodes]
##                      rooted
# prune OTUs that are not present in at least one sample
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
# Define a human-associated versus non-human categorical variable:
sample_data(GP)$human <- factor(get_variable(GP, "SampleType") %in% c("Feces", 
    "Mock", "Skin", "Tongue"))

"Rarefy" Abundances to Even Depth

Although of perhaps dubious necessity, it is common for OTU abundance tables to be randomly subsampled to even sequencing depth prior various analyses, especially UniFrac. Here is an example comparing the UniFrac-PCoA results with and without "rarefying" the abundances (Requires phyloseq v1.1.28+)

Test with esophagus dataset

data(esophagus)
eso <- rarefy_even_depth(esophagus)
# plot(as(otu_table(eso), 'vector'), as(otu_table(esophagus), 'vector'))
UniFrac(eso)
##        B      C
## C 0.5562       
## D 0.5177 0.6149
UniFrac(esophagus)
##        B      C
## C 0.5176       
## D 0.5182 0.5422

Test with GlobalPatterns dataset

data(GlobalPatterns)
GP.chl <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
# remove the samples that have less than 20 total reads from Chlamydiae
GP.chl <- prune_samples(names(which(sample_sums(GP.chl) >= 20)), GP.chl)
# (p <- plot_tree(GP.chl, color='SampleType', shape='Family',
# label.tips='Genus', size='abundance'))
GP.chl.r <- rarefy_even_depth(GP.chl)
# plot(as(otu_table(GP.chl.r), 'vector'), as(otu_table(GP.chl), 'vector'))

To compare MDS of unweighted-UniFrac for GP.chl and GP.chl.r (default distance is unweighted UniFrac)

plot_ordination(GP.chl, ordinate(GP.chl, "MDS"), color = "SampleType") + geom_point(size = 5)
plot of chunk unnamed-chunk-18

plot of chunk unnamed-chunk-18

plot_ordination(GP.chl.r, ordinate(GP.chl.r, "MDS"), color = "SampleType") + 
    geom_point(size = 5)
plot of chunk unnamed-chunk-19

plot of chunk unnamed-chunk-19

How does rarefying affect the larger untrimmed dataset? (not run)

GP.r <- rarefy_even_depth(GP)
plot_ordination(GP, ordinate(GP), color = "SampleType") + geom_point(size = 5)
plot_ordination(GP.r, ordinate(GP.r), color = "SampleType") + geom_point(size = 5)

prune_taxa() vs. subset_taxa()

These are two very different methods for subsetting OTUs in a dataset.

topN <- 20
most_abundant_taxa <- sort(taxa_sums(GP), TRUE)[1:topN]
print(most_abundant_taxa)
##  549656  331820  279599  360229  317182   94166  158660  329744  550960 
## 2481214 1001492  927850  556441  528629  444142  443938  436539  384205 
##  189047  326977  317658  244304  171551  263681   98605   12812  536311 
##  312161  291577  234024  222365  215470  212089  203583  196550  196355 
##  192573  298875 
##  179312  177879 
GP20 <- prune_taxa(names(most_abundant_taxa), GP)
# Exploratory tree #1 (Note, too ma) plot_tree(ex2, color='SampleType',
# label.tips='Family', size='abundance')
ntaxa(GP20)
## [1] 20
length(get_taxa_unique(GP20, "Phylum"))
## [1] 5

That was prune_taxa, applicable when we know (or can provide) the OTU IDs of the OTUs we want to retain in the dataset, or a logical vector with the same length as ntaxa reseulting from a test (useful for genefilter or`genefilter_sample results (see next section)).

Alternatively, you can subset based on taxonomy expressions, using subset_taxa.

GP.chl <- subset_taxa(GP, Phylum == "Chlamydiae")
# Exploratory tree #2 plot_tree(GP.chl, color='SampleType',
# shape='Family', label.tips='Genus', size='abundance')
ntaxa(GP.chl)
## [1] 21
length(get_taxa_unique(GP.chl, "Phylum"))
## [1] 1

filterfun_sample() and genefilter_sample()

This tool takes after the genefilter function from the genefilter package, but emphasizes within-microbiome conditions. The following code illustrates . The function topp is a filter function that returns the most abundant p fraction of taxa. The filterfun_sample function takes one or or more functions like topp and binds them in order to define a filtering protocol function, in this case called: f1. This function, f1, is then passed to genefilter_sample along with the dataset that is going to be pruned as well as a value for A, the number of samples in which an OTU must pass the filtering conditions.

topp(0.1)
## function (x) 
## {
##     if (na.rm) {
##         x = x[!is.na(x)]
##     }
##     x >= sort(x, decreasing = TRUE)[ceiling(length(x) * p)]
## }
## <environment: 0x109ab9d20>
f1 <- filterfun_sample(topp(0.1))
print(f1)
## function (x) 
## {
##     fun = flist[[1]]
##     fval = fun(x)
##     for (fun in flist[-1]) {
##         fval = fval & fun(x)
##     }
##     return(fval)
## }
## <environment: 0x109250808>
## attr(,"class")
## [1] "filterfun"
wh1 <- genefilter_sample(GP, f1, A = round(0.5 * nsamples(GP)))
sum(wh1)
## [1] 793
ex2 <- prune_taxa(wh1, GP)
print(GP)
## phyloseq-class experiment-level object
## OTU Table:          [18988 taxa and 26 samples]
##                      taxa are rows
## Sample Data:         [26 samples by 8 sample variables]:
## Taxonomy Table:     [18988 taxa by 7 taxonomic ranks]:
## Phylogenetic Tree:  [18988 tips and 18987 internal nodes]
##                      rooted
print(ex2)
## phyloseq-class experiment-level object
## OTU Table:          [793 taxa and 26 samples]
##                      taxa are rows
## Sample Data:         [26 samples by 8 sample variables]:
## Taxonomy Table:     [793 taxa by 7 taxonomic ranks]:
## Phylogenetic Tree:  [793 tips and 792 internal nodes]
##                      rooted

Filtering low-variance OTUs

Suppose we wanted to use the variance of OTUs across samples as a condition for filtering. For example, to remove OTUs that do not change much across all (or most) samples. Note that the value of the variance is highly-dependent on the sequencing effort of each sample (the total number of reads sequenced from a particular sample). Thus, we must first transform the sample counts to relative abundance, which is shown in more detail in the next section. The following code will create a version of the GP dataset in which the abundance values have been transformed to relative abundance within each sample, and then OTUs have been filtered to keep only those with variance greater than 0.00001 (assuming we wanted to pick an arbitrary threshold in this way).

GPr = transform_sample_counts(GP, function(x) x/sum(x))
GPf = filter_taxa(GPr, function(x) var(x) > 1e-05, TRUE)

Transformations

Useful for: Standardization / Normalization / Smoothing / Shrinking. Second-order function: transform_sample_counts. Also the threshrank and threshrankfun functions.

For more advanced normalization features (like shrinkage, etc.), also consider features in the edgeR package, the DESeq package, and for standardization the decostand function in the vegan-package; as well as probably many others that could be useful in this context.

Graphics for Inference and Exploration

In the following section(s) we will illustrate using graphical tools provided by the phyloseq package. These are meant to be flexible ways to explore and summarize the data.

For further details, there is a collection of "show and tell" wiki-pages describing the available graphics options supported by the phyloseq-package. These include example code for reproducing the figures shown. Many of the default settings are modifiable within the function arguments directly, and virtually everything about these plots can be further modified via the layers interface of ggplot2.

For quick reference (even though some have been described already), the key graphics-producing functions in phyloseq are:

plot_heatmap

plot_tree

plot_ordination

plot_network

plot_richness_estimates

plot_taxa_bar

Let's (re)load the Global Patterns dataset, prune the empty taxa, and add a custom sample variable called "human", a logical indicating whether or not the samples are human-associated.

data(GlobalPatterns)  # Reload GlobalPatterns
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
sample_data(GP)$human <- factor(get_variable(GP, "SampleType") %in% c("Feces", 
    "Mock", "Skin", "Tongue"))

We are going to show some examples that would take a lot of time to calculate and render on a dataset as large as GlobalPatterns. For the sake of time in re-running these examples, let's subset the data further to the most abundant 5 phyla.

top5ph <- sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), TRUE)[1:5]
GP1 <- subset_taxa(GP, Phylum %in% names(top5ph))
## Error: object 'top5ph' not found

Distance Functions

Before we get to network plots, let's discuss distances. Many tools use distances to perform their calculations. In phyloseq, ordinations, heatmaps, and network plots all use the distance function for calculating OTU or Sample distance matrices (actually represented as a "dist" object) when needed, particularly when PCoA/MDS or NMDS is involved.

help("distance")
`?`(distance)

A relatively recent, popular distance method that relies heavily on the phylogenetic tree is UniFrac. It has been implemented in phyloseq as a fast parallel function, also wrapped by the distance function. See the phyloseq wiki page describing Fast Parallel UniFrac for further details, citations, and performance results.

data(esophagus)
distance(esophagus)  # unweighted UniFrac
##        B      C
## C 0.5176       
## D 0.5182 0.5422
distance(esophagus, weighted = TRUE)  # weighted UniFrac
##        B      C
## C 0.2035       
## D 0.2603 0.2477

Here are some other examples. There are some 45 or so methods.

distance(esophagus, "jaccard")  # vegdist jaccard
distance(esophagus, "bray")  # vegdist bray-curtis
distance(esophagus, "gower")
distance(esophagus, "g")
distance(esophagus, "minkowski")  # invokes a method from the base dist() function.
distance(esophagus, "(A+B-2*J)/(A+B)")  # designdist custom distance
distance("help")
distance("list")

The plot_network function

The following code illustrates using the make_network and plot_network functions in phyloseq. In this context, we are using networks to graphically represent thresholded distances between samples or OTUs. The euclidean distances between points on the plot are essentially arbitrary, only the "edges" (lines) between "nodes" (OTUs/samples) are derived directly from the data. For further examples, it is recommended that you take a look at the plot_network wiki page

The GP variable is an only-slightly-modified version of the GlobalPatterns dataset. The threshold was determined empirically to show something interesting for demonstration. In practice, this value has a huge effect on the resulting network, and its usefulness, and it is highly recommended that you investigate the results from multiple values.

ig <- make_network(GP, type = "samples", distance = "bray", max.dist = 0.85)
plot_network(ig, GP, color = "SampleType", shape = "human", line_weight = 0.4, 
    label = NULL)
plot of chunk plotnetwork-GPsamples

plot of chunk plotnetwork-GPsamples

A similar network representation of samples from the "Enterotypes" dataset.

data(enterotype)
ig <- make_network(enterotype, max.dist = 0.3)
plot_network(ig, enterotype, color = "SeqTech", shape = "Enterotype", line_weight = 0.4, 
    label = NULL)
## Warning: Removed 5 rows containing missing values (geom_point).
plot of chunk plotnetwork-entsamples

plot of chunk plotnetwork-entsamples

An example showing a network representation of OTUs, representing communities of bacteria that occurr in similar profiles of samples.

data(GlobalPatterns)
# prune to just the top 100 most abundant OTUs across all samples (crude).
GP100 <- prune_taxa(names(sort(taxa_sums(GlobalPatterns), TRUE))[1:100], GlobalPatterns)
jg <- make_network(GP100, "taxa", "jaccard", 0.3)
plot_network(jg, GP100, "taxa", color = "Phylum", line_weight = 0.4, label = NULL)
plot of chunk plotnetwork-otus

plot of chunk plotnetwork-otus

Ordination Methods

"Ordination methods" in this context refers to methods for dimensional reduction of data -- usually the OTU abundance data, which is probably a large sparse matrix not so amenable to graphical display on its own. Graphically investigating the (usually) information-dense first few axes of an ordination result can be very useful for exploring and summarizing phylogenetic sequencing data. One of the resons for this is that many ordination methods are non-parametric, so they do not depend upon a prior hypothesis or model. This is essential for many microbiome investigations in which a model is only vaguely described or not available.

A good quick summary of ordination methods is provided in the introductory vignette for the vegan package:

vegan introductory vignette

The following R task views are also useful for understanding ordination tools in R:

Analysis of Ecological and Environmental Data

Multivariate Statistics

The ade4 package also provides a large number of ordination methods, and may be useful in your analysis.

The ordinate function

This function wraps several commonly-used ordination methods for OTU abundance tables (as well as related tables, in some cases). The type of ordination performed depends upon the argument to method. Try ordinate("help") or ordinate("list") for the currently supported method options.

The output of this function will be an ordination class. The specific class depends upon the ordination method used, as well as the underlying function/package that is called internally by phyloseq to perform it. As a general rule, any of the ordination classes returned by this function, ordinate, will be recognized by downstream tools in the phyloseq package -- for example the ordination plotting function, plot_ordination (See next section for plot examples).

Using GP100 from the previous section, let's calculate the unweighted-UniFrac distance for each sample pair in the dataset, and then perform Multidimensional Scaling (aka Principle Coordinates Analysis) on the resulting distance. For details about calculating the UniFrac distance on larger datasets using parallel-computing options in supported by phyloseq, see the wiki page on Fast Parallel UniFrac in phyloseq

GP.MDS <- ordinate(GP100, method = "MDS", distance = "unifrac")

Here are just a few examples of other supported combinations.

# GP.NMDS <- ordinate(GP, 'NMDS', 'gower') GP.NMDS <- ordinate(GP, 'NMDS',
# 'bray') # perform NMDS on bray-curtis distance GP.NMDS.UF.ord <-
# ordinate(GP, 'NMDS') # UniFrac. Takes a while. GP.NMDS.wUF.ord <-
# ordinate(GP, 'NMDS', 'unifrac', weighted=TRUE) # weighted-UniFrac
# GP.NMDS.gower <- ordinate(GP, 'NMDS', 'gower')

The plot_ordination function

The plot_ordination function has many options, and supports many combinations of ordinations, including the mapping of sample and/or OTU variables to color and shape aesthetics. Many additional examples (with results) are included on the plot_ordination wiki page. For quicker reference, some example "1-liners" are also included at bottom of this section.

This combination of MDS/PCoA ordination of the UniFrac distance is recently very popular in microbiome analyses.

require("ggplot2")
ptitle <- "GP PCoA of UniFrac distance, GP most abundant 100 OTUs only"
p <- plot_ordination(GP100, GP.MDS, type = "samples", color = "SampleType", 
    title = ptitle)
p + geom_point(size = 5) + geom_polygon(aes(fill = SampleType))
plot of chunk GP100-UniFrac-PCoA

plot of chunk GP100-UniFrac-PCoA

Get the names of the most-abundant phyla, and use for subsetting.

top.phyla <- sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), TRUE)
## Error: could not find function "tax_table"
top.phyla <- top.phyla[1:5]
## Error: object 'top.phyla' not found
# Prune to just the most-abundant 5 phyla
GP2 <- subset_taxa(GP, Phylum %in% names(top.phyla))
## Error: could not find function "subset_taxa"
get_taxa_unique(GP2, "Phylum")
## Error: could not find function "get_taxa_unique"
# Prune further, to top 200 most abundant taxa of top 5 phyla
GP2 <- prune_taxa(names(sort(taxa_sums(GP2), TRUE)[1:200]), GP2)
## Error: could not find function "prune_taxa"

Let's try Correspondence Analysis with "one-liner" syntax in which we include the ordinate call within the plot_ordination command.

require("ggplot2")
p2 <- plot_ordination(GP2, ordinate(GP2, "CCA"), type = "samples", color = "SampleType")
## Error: object 'GP2' not found
p2 + geom_point(size = 5) + geom_polygon(aes(fill = SampleType))
## Error: object 'p2' not found
plot_ordination(GP2, ordinate(GP2, "CCA"), type = "taxa", color = "Phylum") + 
    geom_point(size = 4)
plot_ordination(GP2, ordinate(GP2, "CCA"), type = "split")
plot_ordination(GP2, ordinate(GP2, "CCA"), type = "split", color = "SampleType")
plot_ordination(GP2, ordinate(GP2, "CCA"), type = "biplot", shape = "Phylum")
plot_ordination(GP2, ordinate(GP2, "CCA"), type = "split", color = "Phylum", 
    label = "SampleType")
plot_ordination(GP2, ordinate(GP2, "CCA"), type = "split", color = "SampleType", 
    shape = "Phylum", label = "SampleType")

Mapping Continuous Variables to Color

(Note: can't map continuous variables to shape with plot_ordination function)

p4title <- "Bushman dataset, PCoA/MDS ordination on Bray-Curtis distance"
Bushman.ord <- ordinate(Bushman, method = "MDS", distance = "bray")
plot_ordination(Bushman, Bushman.ord, "samples", color = "OMEGA3_FATTY_ACIDS_G_AVE", 
    title = p4title)
plot of chunk Bushman-MDS-bray

plot of chunk Bushman-MDS-bray

The plot_heatmap() function

In a 2010 article in BMC Genomics, Rajaram and Oono describe an approach to creating a heatmap using ordination methods (namely, NMDS and PCA) to organize the rows and columns instead of (hierarchical) cluster analysis. In many cases the ordination-based ordering does a much better job than h-clustering at providing an order of elements that is easily interpretable. The authors provided an immediately useful example of their approach as the NeatMap package for R. The NeatMap package can be used directly on the abundance table ("otu_table"-class) of phylogenetic-sequencing data, but the NMDS or PCA ordination options that it supports are not based on ecological distances. To fill this void, and because phyloseq already provides support for a large number of ecological distances and ordination methods, phyloseq now includes the plot_heatmap() function: an ecology-oriented variant of the NeatMap approach to organizing a heatmap and build it using ggplot graphics tools. The distance and method arguments are the same as for the plot_ordination function, and support large number of distances and ordination methods, respectively, with a strong leaning toward ecology. This function also provides the options to re-label the OTU and sample axis-ticks with a taxonomic name and/or sample variable, respectively, in the hope that this might hasten your interpretation of the patterns (See the documentation for the sample.label and species.label arguments, and the examples below). Note that this function makes no attempt to overlay dendrograms from hierarchical clustering next to the axes, as hierarchical clustering is not used to organize these plots. Also note that each re-ordered axis repeats at the edge, and so apparent clusters at the far right/left or top/bottom of the heat-map may actually be the same. For now, the placement of this edge can be considered arbitrary, so beware of this artifact of the graphic and visually check if there are two "mergeable" clusters at the edges of a particular axis. If you benefit from this phyloseq-specific implementation of the NeatMap approach, please cite the NeatMap article, as well as phyloseq.

Further examples are provided at the plot_heatmap wiki page

plot_heatmap(GP2, "NMDS", "bray", "SampleType", "Family")
## Error: could not find function "plot_heatmap"

Some alternative plot_heatmap transformations.

# plot_heatmap(GP2, 'NMDS', 'bray', 'SampleType', 'Family',
# trans=log_trans(10)) plot_heatmap(GP2, 'NMDS', 'bray', 'SampleType',
# 'Family', trans=identity_trans()) plot_heatmap(GP2, 'NMDS', 'bray',
# 'SampleType', 'Family', trans=boxcox_trans(0.15))

Validation

In this section, we will look at examples for using R to validate/test hypotheses we may have generated through some of the previous exploration.

Multiple Testing

In this example we will perform testing on fractional abundances to remove effect of differences in total sequencing across samples for same taxa. We first filter the low-variance taxa, avoiding the noise and "penalty" we pay for testing taxa that don't vary at much across samples. In practice, these OTU abundances probably need additional preprocessing prior to testing, and many good methods in microarray analysis and differential expression sequencing could probably apply to this data (but are not directly implemented/supported in phyloseq, yet). Otherwise, beware that the old motto "garbage-in, garbage-out" can definitely apply to your data if you are not careful.

GPr = transform_sample_counts(GP, function(x) x/sum(x))
GP3f = filter_taxa(GPr, function(x) var(x) > 1e-05, TRUE)

We are going to use the multtest wrapper included in phyloseq, mt. Try ?mt for help on this function. The following code uses this wrapper to calculate the multiple-inference-adjusted P-values, using the "human" sample variable.

GP.fwer.table <- mt(GP3f, "human")
## B=10000
## b=100    b=200   b=300   b=400   b=500   b=600   b=700   b=800   b=900   b=1000  
## b=1100   b=1200  b=1300  b=1400  b=1500  b=1600  b=1700  b=1800  b=1900  b=2000  
## b=2100   b=2200  b=2300  b=2400  b=2500  b=2600  b=2700  b=2800  b=2900  b=3000  
## b=3100   b=3200  b=3300  b=3400  b=3500  b=3600  b=3700  b=3800  b=3900  b=4000  
## b=4100   b=4200  b=4300  b=4400  b=4500  b=4600  b=4700  b=4800  b=4900  b=5000  
## b=5100   b=5200  b=5300  b=5400  b=5500  b=5600  b=5700  b=5800  b=5900  b=6000  
## b=6100   b=6200  b=6300  b=6400  b=6500  b=6600  b=6700  b=6800  b=6900  b=7000  
## b=7100   b=7200  b=7300  b=7400  b=7500  b=7600  b=7700  b=7800  b=7900  b=8000  
## b=8100   b=8200  b=8300  b=8400  b=8500  b=8600  b=8700  b=8800  b=8900  b=9000  
## b=9100   b=9200  b=9300  b=9400  b=9500  b=9600  b=9700  b=9800  b=9900  b=10000 
## r=1  r=2 r=3 r=4 r=5 r=6 r=7 r=8 r=9 r=10    
## r=11 r=12    r=13    r=14    r=15    r=16    r=17    r=18    r=19    r=20    
## r=21 r=22    r=23    r=24    r=25    r=26    r=27    r=28    r=29    r=30    
## r=31 r=32    r=33    r=34    r=35    r=36    r=37    r=38    r=39    r=40    
## r=41 r=42    r=43    r=44    r=45    r=46    r=47    r=48    r=49    r=50    
## r=51 r=52    r=53    r=54    r=55    r=56    r=57    r=58    r=59    r=60    
## r=61 r=62    r=63    r=64    r=65    r=66    r=67    r=68    r=69    r=70    
## r=71 r=72    r=73    r=74    r=75    r=76    r=77    r=78    r=79    r=80    
## r=81 r=82    r=83    r=84    r=85    r=86    r=87    r=88    r=89    r=90    
## r=91 r=92    r=93    r=94    r=95    r=96    r=97    r=98    r=99    r=100   
## r=101    r=102   r=103   r=104   r=105   r=106   r=107   r=108   r=109   r=110   
## r=111    r=112   r=113   r=114   r=115   r=116   r=117   r=118   r=119   r=120   
## r=121    r=122   r=123   r=124   r=125   r=126   r=127   r=128   r=129   r=130   
## r=131    r=132   r=133   r=134   r=135   r=136   r=137   r=138   r=139   r=140   
## r=141    r=142   r=143   r=144   r=145   r=146   r=147   r=148   r=149   r=150   
## r=151    r=152   r=153   r=154   r=155   r=156   r=157   r=158   r=159   r=160   
## r=161    r=162   r=163   r=164   r=165   r=166   r=167   r=168   r=169   r=170   
## r=171    r=172   r=173   r=174   r=175   r=176   r=177   

No add some taxonomic columns to the result for interpretation (rows are OTUs)

jranks <- c("Phylum", "Family", "Genus")
GP.fwer.table <- data.frame(GP.fwer.table, tax_table(GP3f)[rownames(GP.fwer.table), 
    jranks])
subset(GP.fwer.table, adjp < 0.05)
##        index teststat  rawp   adjp plower        Phylum           Family
## 108747   173    2.574 4e-04 0.0300 0.0233    Firmicutes Streptococcaceae
## 348374   122    1.263 4e-04 0.0300 0.0233 Bacteroidetes   Bacteroidaceae
## 158660   110    2.313 5e-04 0.0356 0.0296 Bacteroidetes   Bacteroidaceae
##                Genus
## 108747 Streptococcus
## 348374   Bacteroides
## 158660   Bacteroides

What if we want FDR instead of FWER?

Or to use other tools in multtest-package?

library("multtest")
mtm <- mt(GP3f, "human")
## B=10000
## b=100    b=200   b=300   
## b=400    b=500   b=600   b=700   b=800   b=900   b=1000  b=1100  b=1200  b=1300  
## b=1400   b=1500  b=1600  b=1700  b=1800  b=1900  b=2000  b=2100  b=2200  b=2300  
## b=2400   b=2500  b=2600  b=2700  b=2800  b=2900  b=3000  b=3100  b=3200  b=3300  
## b=3400   b=3500  b=3600  b=3700  b=3800  b=3900  b=4000  b=4100  b=4200  b=4300  
## b=4400   b=4500  b=4600  b=4700  b=4800  b=4900  b=5000  b=5100  b=5200  b=5300  
## b=5400   b=5500  b=5600  b=5700  b=5800  b=5900  b=6000  b=6100  b=6200  b=6300  
## b=6400   b=6500  b=6600  b=6700  b=6800  b=6900  b=7000  b=7100  b=7200  b=7300  
## b=7400   b=7500  b=7600  b=7700  b=7800  b=7900  b=8000  b=8100  b=8200  b=8300  
## b=8400   b=8500  b=8600  b=8700  b=8800  b=8900  b=9000  b=9100  b=9200  b=9300  
## b=9400   b=9500  b=9600  b=9700  b=9800  b=9900  b=10000 r=1 r=2 r=3 
## r=4  r=5 r=6 r=7 r=8 r=9 r=10    r=11    r=12    r=13    
## r=14 r=15    r=16    r=17    r=18    r=19    r=20    r=21    r=22    r=23    
## r=24 r=25    r=26    r=27    r=28    r=29    r=30    r=31    r=32    r=33    
## r=34 r=35    r=36    r=37    r=38    r=39    r=40    r=41    r=42    r=43    
## r=44 r=45    r=46    r=47    r=48    r=49    r=50    r=51    r=52    r=53    
## r=54 r=55    r=56    r=57    r=58    r=59    r=60    r=61    r=62    r=63    
## r=64 r=65    r=66    r=67    r=68    r=69    r=70    r=71    r=72    r=73    
## r=74 r=75    r=76    r=77    r=78    r=79    r=80    r=81    r=82    r=83    
## r=84 r=85    r=86    r=87    r=88    r=89    r=90    r=91    r=92    r=93    
## r=94 r=95    r=96    r=97    r=98    r=99    r=100   r=101   r=102   r=103   
## r=104    r=105   r=106   r=107   r=108   r=109   r=110   r=111   r=112   r=113   
## r=114    r=115   r=116   r=117   r=118   r=119   r=120   r=121   r=122   r=123   
## r=124    r=125   r=126   r=127   r=128   r=129   r=130   r=131   r=132   r=133   
## r=134    r=135   r=136   r=137   r=138   r=139   r=140   r=141   r=142   r=143   
## r=144    r=145   r=146   r=147   r=148   r=149   r=150   r=151   r=152   r=153   
## r=154    r=155   r=156   r=157   r=158   r=159   r=160   r=161   r=162   r=163   
## r=164    r=165   r=166   r=167   r=168   r=169   r=170   r=171   r=172   r=173   
## r=174    r=175   r=176   r=177   

Re-order to original, and use raw p-values for adjustment via mt.rawp2adjp()

procedure <- c("Bonferroni", "Hochberg", "BH")
p.mtm <- mt.rawp2adjp(mtm[order(mtm[, "index"]), "rawp"], procedure)
# Re-order so that you can return original table (ordered p-value table)
p.adjp.ord <- p.mtm$adjp[order(p.mtm$index), ]
# Give it the original row names from m
rownames(p.adjp.ord) <- taxa_names(GP3f)
# Return the table of adjusted p-values for each hypothesis.
GP3f.mt.table <- data.frame(p.adjp.ord, tax_table(GP3f)[rownames(p.adjp.ord), 
    jranks])
# Re-rorder based on BH
GP3f.mt.table <- GP3f.mt.table[order(GP3f.mt.table[, "BH"]), ]
subset(GP3f.mt.table, BH < 0.05)
##          rawp Bonferroni Hochberg      BH        Phylum             Family
## 158660 0.0005     0.0885   0.0875 0.02950 Bacteroidetes     Bacteroidaceae
## 348374 0.0004     0.0708   0.0704 0.02950 Bacteroidetes     Bacteroidaceae
## 108747 0.0004     0.0708   0.0704 0.02950    Firmicutes   Streptococcaceae
## 561077 0.0019     0.3363   0.3249 0.04646 Cyanobacteria               <NA>
## 322235 0.0017     0.3009   0.2941 0.04646 Bacteroidetes     Bacteroidaceae
## 331820 0.0021     0.3717   0.3570 0.04646 Bacteroidetes     Bacteroidaceae
## 291090 0.0018     0.3186   0.3096 0.04646 Bacteroidetes Porphyromonadaceae
## 259569 0.0017     0.3009   0.2941 0.04646 Bacteroidetes      Rikenellaceae
##                  Genus
## 158660     Bacteroides
## 348374     Bacteroides
## 108747   Streptococcus
## 561077            <NA>
## 322235     Bacteroides
## 331820     Bacteroides
## 291090 Parabacteroides
## 259569       Alistipes

Some alternative packages for multiple inference correction: the qvalue package, the multcomp package

Getting phyloseq Data into Other R Tools

A common question from many users related to how they can easily get phyloseq-formatted data into other R tools. The following examples are meant to illustrate doing that with some commonly-requested tasks.

Porting Data to vegan Functions

The vegan package is a popular and well-maintained R package (hosted in CRAN) "for community and vegetation ecologists". It provides ordination methods, diversity analysis and other functions. Many of vegan's distance and ordination functions are wrapped by functions in phyloseq. Of course, not everything in vegan is wrapped by phyloseq, and it may turn out that you need to use some function that is not wrapped by phyloseq. What do you do? The following subsection provides example code for running just such a function by accessing and coercing the necessary data components from a phyloseq data object.

For OTU abundance tables, vegan expects samples as rows, and OTUs/species/taxa as columns (so does the picante package). The following is an example function, called veganotu, for extracting the OTU table from a phyloseq data object, and converting it to a properly oriented standard matrix recognized by the vegan package.

We will use the bioenv function from vegan to test for sample variables that correlate well with the microbial community distances.

veganotu <- function(physeq) {
    require("vegan")
    OTU <- otu_table(physeq)
    if (taxa_are_rows(OTU)) {
        OTU <- t(OTU)
    }
    return(as(OTU, "matrix"))
}

Now we can use this function for data input to vegan functions. Let's try this with the Bushman dataset, since it has ample continuous variables with which to correlate microbiom distances. First, let's coerce the Bushman sample data component into a vanilla "data.frame" class that we will call bushsd.

keepvariables <- which(sapply(sample_data(Bushman), is.numeric))
bushsd <- data.frame(sample_data(Bushman))[keepvariables]

Now let's make a call to bioenv to see what happens... (not actually run in example, takes too long, be prepared to stop the run if you try it)

# bioenv(veganotu(Bushman), bushsd)

There are so many sample variables in the Bushman data that we would have to calculate 1.225996e+55 possible subsets in an exhaustive search calculation. That could take a long time! Note from the bioenv documentation that we could have expected that issue:

"There are 2^p-1 subsets of p variables, and an exhaustive search may take a very, very, very long time (parameter upto offers a partial relief)."

So one option is to use the upto parameter. Another is to specify a model formula when specifying the primary data argument to this function (see ?"~" for some details), which allows us to specify variables in the correlation search. We could also simply trim the number of columns in bushsd to just the few variables that we really care about. Let's try the second option here, the model formula approach, and focus on a few that we are interested in comparing. We'll look at the variable names again to remind us what is available

names(bushsd)[1:10]
##  [1] "ASPARTAME_MG_AVE"            "TOT_CONJUGLINOLEICA_G_AVE"  
##  [3] "AGE"                         "VITC_ASCORBIC_ACID_MG_AVE"  
##  [5] "OXALIC_ACID_MG_AVE"          "PUFA_EICOSAPENTAENOIC_G_AVE"
##  [7] "PHOSPHORUS_G_AVE"            "SOLUBLE_DIETARY_FIBER_G_AVE"
##  [9] "SFA_MARGARIC_ACID_G_AVE"     "CLA_TRANS10_CIS12_G_AVE"    

I arbitrarily chose a mixture of variables from the full list previewed above.

bioenv(veganotu(Bushman) ~ DEPTH + AGE + TOTAL_FAT_G_AVE + INSOLUBLE_DIETARY_FIBER_G_AVE, 
    bushsd)
## 
## Call:
## bioenv(formula = veganotu(Bushman) ~ DEPTH + AGE + TOTAL_FAT_G_AVE +      INSOLUBLE_DIETARY_FIBER_G_AVE, data = bushsd) 
## 
## Subset of environmental variables with best correlation to community data.
## 
## Correlations:      spearman 
## Dissimilarities:   bray 
## 
## Best model has 2 parameters (max. 4 allowed):
## AGE INSOLUBLE_DIETARY_FIBER_G_AVE
## with correlation  0.1537 
## 

Now that you know how to get phyloseq data into vegan, all of vegan's tools are now available to you to use as well. The following list is not exhaustive and focuses on the most popular tools, taken from the vegan website front page:

Ordination Example on the Gap Statistic

Gap Statistic: How many clusters are there?

From the clusGap documentation: The clusGap function from the cluster package calculates a goodness of clustering measure, called the “gap” statistic. For each number of clusters k, it compares (W(k)) with E^*[(W(k))] where the latter is defined via bootstrapping, i.e. simulating from a reference distribution.

The following is an example performing the gap statistic on ordination results calculated using phyloseq tools, followed by an example of how a ggplot-based wrapper for this example might be included in the phyloseq package.

First perform the ordination using correspondence analysis

library("cluster")
# Load data
data(enterotype)
# ordination
ent.ca <- ordinate(enterotype, method = "CCA", distance = NULL)

Now the Gap Statistic code

pam1 <- function(x, k) list(cluster = pam(x, k, cluster.only = TRUE))
x <- scores(ent.ca, display = "sites")
# gskmn <- clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn <- clusGap(x[, 1:2], FUN = pam1, K.max = 6, B = 50)
gskmn
## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..6
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 3
##       logW E.logW   gap  SE.sim
## [1,] 4.544  5.746 1.201 0.02188
## [2,] 3.720  5.190 1.470 0.02070
## [3,] 3.428  4.928 1.500 0.02031
## [4,] 3.301  4.777 1.476 0.02066
## [5,] 3.100  4.682 1.582 0.02140
## [6,] 2.957  4.598 1.641 0.01962

That's nice. Just in case it is useful, let's look at what the wrapper-function might look like.

gap_statistic_ordination <- function(ord, FUNcluster, type = "sites", K.max = 6, 
    axes = c(1:2), B = 500, verbose = interactive(), ...) {
    require("cluster")
    # If 'pam1' was chosen, use this internally defined call to pam
    if (FUNcluster == "pam1") {
        FUNcluster <- function(x, k) list(cluster = pam(x, k, cluster.only = TRUE))
    }
    # Use the scores function to get the ordination coordinates
    x <- scores(ord, display = type)
    # If axes not explicitly defined (NULL), then use all of them
    if (is.null(axes)) {
        axes <- 1:ncol(x)
    }
    # Finally, perform, and return, the gap statistic calculation using
    # cluster::clusGap
    clusGap(x[, axes], FUN = FUNcluster, K.max = K.max, B = B, verbose = verbose, 
        ...)
}
# Define a plot method for results...
plot_clusgap <- function(clusgap, title = "Gap Statistic calculation results") {
    require("ggplot2")
    gstab <- data.frame(gs$Tab, k = 1:nrow(gs$Tab))
    p <- ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size = 5)
    p <- p + geom_errorbar(aes(ymax = gap + SE.sim, ymin = gap - SE.sim))
    p <- p + opts(title = title)
    return(p)
}

Now try out this function. Should work on ordination classes recognized by scores function, and provide a ggplot graphic instead of a base graphic. (Special Note: the phyloseq-defined scores extensions are not exported as regular functions to avoid conflict, so phyloseq-defined scores extensions can only be accessed with the phyloseq::: namespace prefix in front.)

gs <- gap_statistic_ordination(ent.ca, "pam1", B = 50, verbose = FALSE)
print(gs, method = "Tibs2001SEmax")
## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..6
##  --> Number of clusters (method 'Tibs2001SEmax', SE.factor=1): 3
##       logW E.logW   gap  SE.sim
## [1,] 4.544  5.747 1.202 0.02210
## [2,] 3.720  5.188 1.468 0.02352
## [3,] 3.428  4.928 1.500 0.01737
## [4,] 3.301  4.780 1.479 0.02058
## [5,] 3.100  4.684 1.584 0.02061
## [6,] 2.957  4.598 1.641 0.01946
plot_clusgap(gs)
## Error: object 'gs' not found

Base graphics plotting, for comparison.

plot(gs, main = "Gap statistic for the 'Enterotypes' data")
mtext("k = 2 is best ... but  k = 3  pretty close")
plot of chunk unnamed-chunk-38

plot of chunk unnamed-chunk-38

Future Directions for phyloseq

Suggestions from Attendees/Developers

Don't be afraid to post feedback / needs on the phyloseq issues tracker:

https://github.com/joey711/phyloseq/issues

Big(ger) Data

The firehose of new-gen sequencing data is making possible "big" datasets in this realm. We are considering some of the best approaches to help a tool like phyloseq address computational issues that are arising from dealing with this data, without compromising some of the other features (interactivity, reproducibility, connection with existing R tools). Some promising tools already available for R that might help include:

Sparse matrix classes

Like in the Matrix package. This mainly applies to in-ram computations, especially when the full matrix is actually needed. In principle, this might only apply be required to represent the preprocessed data, which could still be sparse.

Store full dataset in a database

Some suggested packages at this conference so far, obviousl not-yet implemented in phyloseq, are:

Use BioConductor tools for representing data