Demo: phyloseq – An R package for microbiome census data

Paul J. McMurdie and Susan Holmes

Statistics Department, Stanford University,

Stanford, CA 94305, USA

E-mail

mcmurdie@stanford.edu

susan@stat.stanford.edu

Websites

joey711.github.com/phyloseq

http://www-stat.stanford.edu/~susan/


Summary and Other Documentation Resources

Download all demo materials

The phyloseq main page

This link is the official starting point for phyloseq-related documentation, including links to the key tutorials for phyloseq functionality, installation, and extension.

The phyloseq Issue Tracker

There is a GitHub-hosted issue-tracker for phyloseq, currently describing over 100 feature requests, bug reports, documentation revisions, help requests, and other openly-documented development communication. Only a small fraction of these issues are still outstanding, but the descriptions remain available for anyone to view and add comments/code. If you have a problem with phyloseq that isn't already mentioned on the issue tracker, please contribute to phyloseq by posting the issue! These issues can be about anything related to using the phyloseq package, and help encourage development and generally solve problems. Solutions are often posted with code or the version number of phyloseq that includes a fix (if it is a bug), and these “threads” sit around for others to benefit from as well.


Installation

The phyloseq package is under active development. Users are encouraged to update their version to the latest phyloseq development release on GitHub for the access to the latest fixes/features. The most stable releases and development versions of phyloseq are hosted by Bioconductor. Please see the official installation tutorial for further details.


Load phyloseq

Of course we need to start this tutorial by loading the phyloseq package. This assumes you have already installed phyloseq.

library("phyloseq")
packageVersion("phyloseq")
## [1] '1.7.24'
library("ggplot2")
packageVersion("ggplot2")
## [1] '0.9.3.1'
library("scales")
packageVersion("scales")
## [1] '0.2.3'
library("grid")
packageVersion("grid")
## [1] '3.0.2'

ggplot2 package theme set. See the ggplot2 online documentation for further help.

theme_set(theme_bw())

In-package documentation

The phyloseq package includes extensive documentation of all functions, data, and classes; usually with example code that you can run yourself. For example, here are two different R commands for accessing documentation (manual) pages for phyloseq

`?`("phyloseq-package")
`?`(phyloseq)

The latter loads instead the documentation for the constructor function named phyloseq()

In-package vignettes

vignettes are included in phyloseq. A quick way to load them from within R is:

vignette("phyloseq_basics")
vignette("phyloseq_analysis")

Basic Data Import

Importing the Output from QIIME

otufile = system.file("extdata", "GP_otu_table_rand_short.txt.gz", package = "phyloseq")
mapfile = system.file("extdata", "master_map.txt", package = "phyloseq")
trefile = system.file("extdata", "GP_tree_rand_short.newick.gz", package = "phyloseq")
qiimex = import_qiime(otufile, mapfile, trefile, showProgress = FALSE)
## Processing map file...
## Processing otu/tax file...
## Reading file into memory prior to parsing...
## Detecting first header line...
## Header is on line 2  
## Converting input file to a table...
## Defining OTU table... 
## Parsing taxonomy table...
## Processing phylogenetic tree...
##  /Library/Frameworks/R.framework/Versions/3.0/Resources/library/phyloseq/extdata/GP_tree_rand_short.newick.gz ...
## Warning: treefilename failed import. It will not be included.
qiimex
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 500 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 500 taxa by 7 taxonomic ranks ]

Importing the Output from mothur

Example: Manually re-import the “esophagus dataset”, which is already inclduded in the phyloseq package. Note that the esophagus dataset is a simple dataset consisting of just 3 samples and a relatively small richness, described with just a tree and OTU table. It is most useful for showing quick examples, such as computing the UniFrac distance, but for not much else. If interested, further details can be found by entering ?esophagus.

mothlist = system.file("extdata", "esophagus.fn.list.gz", package = "phyloseq")
mothgroup = system.file("extdata", "esophagus.good.groups.gz", package = "phyloseq")
mothtree = system.file("extdata", "esophagus.tree.gz", package = "phyloseq")
cutoff = "0.10"
esophman = import_mothur(mothlist, mothgroup, mothtree, cutoff)
print(esophman)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 591 taxa and 3 samples ]
## phy_tree()    Phylogenetic Tree: [ 591 tips and 590 internal nodes ]
ntaxa(esophman)
## [1] 591

Let's test if they are identical as expected

data(esophagus)
identical(esophagus, esophman)
## [1] FALSE

Importing biom-format files

The biom-format is intended to be a complete representation of the OTU-clustering results, and so import can be performed with just one connection/file path.

rich_sparse_biom = system.file("extdata", "rich_sparse_otu_table.biom", package = "phyloseq")
rich_sparse = import_biom(rich_sparse_biom, parseFunction = parse_taxonomy_greengenes)
print(rich_sparse)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
rank_names(rich_sparse)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

Note: the current biom-format definition lacks a phylogenetic tree. I am working with the biom-format team on including a phylogenetic tree in the next version of the format, and have contributed a preliminary R package for biom-format I/O support as a pull-request to the biom-format development page on GitHub.

The biom-format definition allows for both sparse and dense representations of the abundance data, and is also flexible enough to allow a “minimal” (abundance table onle) and “rich” forms (includes sample and taxonomy data). All of these forms are supported and automatically recognized/interpreted in phyloseq through the import_biom function.

Define file path to all four format combinations

rich_dense_biom = system.file("extdata", "rich_dense_otu_table.biom", package = "phyloseq")
rich_sparse_biom = system.file("extdata", "rich_sparse_otu_table.biom", package = "phyloseq")
min_dense_biom = system.file("extdata", "min_dense_otu_table.biom", package = "phyloseq")
min_sparse_biom = system.file("extdata", "min_sparse_otu_table.biom", package = "phyloseq")

Each import is only one line, we will import four different example “datasets”

rich_dense = import_biom(rich_dense_biom, parseFunction = parse_taxonomy_greengenes)
rich_sparse = import_biom(rich_sparse_biom, parseFunction = parse_taxonomy_greengenes)
min_dense = import_biom(min_dense_biom, parseFunction = parse_taxonomy_greengenes)
min_sparse = import_biom(min_sparse_biom, parseFunction = parse_taxonomy_greengenes)

Print summary if phyloseq, the component class otherwise.

biom_ex_print = function(i) {
    if (class(i) != "phyloseq") {
        class(i)
    } else {
        i
    }
}
sapply(list(rich_dense, rich_sparse, min_dense, min_sparse), biom_ex_print)
## [[1]]
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## 
## [[2]]
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## 
## [[3]]
## [1] "otu_table"
## 
## [[4]]
## [1] "otu_table"

A More Complicated Import Example

One of the example datasets included in the phyloseq package is derived from the study first describing human microbiome “Enterotypes”, and that dataset is called simply enterotype. It will be called in later examples using the data command.

A more recent study investigating human microbiome “Enterotypes” is titled Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes by Wu et al., Science, 334 (6052), 105–108. One of the three corresponding authors has the last name “Bushman”, which also happens to be the title of the QIIME-processed version of this dataset at the microbio.me/qiime database. Last accessed 2013-April-22, the data from this study had the name Bushman_enterotypes_COMBO, with study_1011 as an alternative index. Here is the ftp address for a compressed copy of this data. The microbio.me servers might require that you log in before you can download from this link.

Since I don't have any direct relationship with that service, I cannot guarantee that the previous link will remain valid. However, for convenience and stability, I already downloaded these “Bushman” data files, and saved the uncompressed, unmodified files locally within the phyloseq-demo repository – in which this demo is being run. Thus, the examples below are showing how to import the data using phyloseq and when the files are in the current working directory. An even more complicated “direct import” example is provided in the last section (“Extra Example”) below, but produces the same result and is not run by the embedded code.

Import Bushman data, already downloaded, main file unzipped.

biom_file = "study_1011_closed_reference_otu_table.biom"
map_file = "study_1011_mapping_file.txt"
biomot = import_biom(biom_file, parseFunction = parse_taxonomy_greengenes)
bmsd = import_qiime_sample_data(map_file)
class(bmsd)
## [1] "sample_data"
## attr(,"package")
## [1] "phyloseq"
dim(bmsd)
## [1] 102 225
biomot
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1873 taxa and 100 samples ]
## tax_table()   Taxonomy Table:    [ 1873 taxa by 7 taxonomic ranks ]

Even though the biom-format definition is intended to also store sample meta data as one of its required fields, in this case the sample metadata was stored as the tab-delimited text file "study_1011_mapping_fiel.txt". It is easy to combine the two data objects that resulted from the import of these two files, bmsd and biomot using the merge_phyloseq function, explained in the next section.

Merging datasets or components

We need to merge these two separate Bushman dataset objects into one “phyloseq” object. Presently, the two data objects contain the otu_table/tax_table, and sample_data components, respectively. If tree and/or representative sequence data were also available, we could combine them as well. If we had three objects that were all components (think single tables, or a tree), then we would use the constructor function, phyloseq. However, because the .biom file contains two tables (including an otu_table), the import_biom function returned a valid "phyloseq-class" instance instead that contains both components. Whenever you need to add or merge data componentes from one (or more) phyloseq-class objects, the merging function, merge_phyloseq, is recommended, rather than the constructor (phyloseq).

Bushman = merge_phyloseq(biomot, bmsd)
Bushman
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1873 taxa and 100 samples ]
## sample_data() Sample Data:       [ 100 samples by 225 sample variables ]
## tax_table()   Taxonomy Table:    [ 1873 taxa by 7 taxonomic ranks ]
plot_richness(Bushman, x = "SEX", color = "SOLUBLE_DIETARY_FIBER_G_AVE") + geom_boxplot()

plot of chunk merge-bushmann-comps

Extra Example: the Human Microbiome Project

See the demo page devoted to importing the HMP dataset: Import the HMP-v35 Dataset. It is an example importing with phyloseq the files produced by Qiime after being run on the Human Microbiome Project's v35 dataset, which is avilable from HMP-DACC.

The code takes about 35 minutes on a laptop, and we are providing the resulting phyloseq-formatted result as an .RData file, so that you do not have to repeat the import process if you do not want to. See the HMP demo for more details.

Extra Example: Direct ftp Download, Unzip, and Import

The .biom and sample data files are also provided online (ftp), and a useful way to download and import into phyloseq directly from the ftp address in the following example code. This is an example in which we download a zip file with both biom- and qiime-formatted data, unzip it in a temporary directory from with in R, import the relavant files using phyloseq importers, and then delete the temporary files. This code should be platform independent, but occasionally there are finicky Windows issues that arise.

(Note: this is not actually run in this demo. Would be redundant, and occasionally Windows issues might crash it, based on experience.)

zipftp = "ftp://thebeast.colorado.edu/pub/QIIME_DB_Public_Studies/study_1011_split_library_seqs_and_mapping.zip"
# First create a temporary directory in which to store the unpacked file(s)
# from the .zip
tmpdir = tempdir()
# Second create a temp file where you will put the .zip-file itself
temp = tempfile()
# Now download the file and unzip to tmpdir directory
download.file(zipftp, temp)
unzip(temp, exdir = tmpdir)
# Define the biom file-path
biom_file = file.path(tmpdir, list.files(tmpdir, pattern = ".biom"))
# Define the mapping file-path
map_file = file.path(tmpdir, list.files(tmpdir, pattern = "mapping"))
# Now import the .biom-formatted otu_table/taxonomyTable file.
biom_otu_tax = import_biom(biom_file, "greengenes")
# Add sample data to the dataset using merge
bmsd = import_qiime_sample_data(map_file)
# Remove the temperorary file and directory where you unpacked the zip files
unlink(temp)
unlink(tmpdir)

Basic Interaction with phyloseq Data

Now that we've gone through some examples of importing data, let's look at some basic commands for interacting with the data/

Print method

Bushman
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1873 taxa and 100 samples ]
## sample_data() Sample Data:       [ 100 samples by 225 sample variables ]
## tax_table()   Taxonomy Table:    [ 1873 taxa by 7 taxonomic ranks ]

Convenience accessors

ntaxa(Bushman)
## [1] 1873
nsamples(Bushman)
## [1] 100
sample_names(Bushman)[1:10]
##  [1] "C.4000.01.P1.405214" "C.3016.01.P1.405202" "C.3019.01.P1.405160"
##  [4] "C.4011.01.P1.405200" "C.4004.01.P1.405170" "C.3007.01.P1.405153"
##  [7] "C.3015.01.P1.405218" "C.3009.01.P1.405180" "C.3004.01.P1.405192"
## [10] "C.4014.01.P1.405150"
taxa_names(Bushman)[1:10]
##  [1] "248563" "110059" "223351" "358030" "367581" "16076"  "313844"
##  [8] "49837"  "517282" "296166"

Interacting with the sample variables

This is useful later in plotting

sample_variables(Bushman)[1:10]
##  [1] "X.SampleID"                  "BarcodeSequence"            
##  [3] "LinkerPrimerSequence"        "ASSIGNED_FROM_GEO"          
##  [5] "ASPARTAME_MG_AVE"            "TOT_CONJUGLINOLEICA_G_AVE"  
##  [7] "AGE"                         "VITC_ASCORBIC_ACID_MG_AVE"  
##  [9] "OXALIC_ACID_MG_AVE"          "PUFA_EICOSAPENTAENOIC_G_AVE"
length(sample_variables(Bushman))
## [1] 225

There are 225 sample variables in the Bushman sample metadata. How can we look at the values for a particular variable? (here I've arbitrarily chosen the 5th sample variable)

get_variable(Bushman, sample_variables(Bushman)[5])[1:10]
##  [1]   0.00  70.50   0.00   0.00  10.26   0.00   0.00  94.63   0.00 157.87

Interacting with the taxonomic ranks

Knowing the available taxonomic ranks and, for instance, how many different phyla are represented can be useful in understanding the data and organizing informative graphics. I will show some of those later.

rank_names(Bushman)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
get_taxa_unique(Bushman, "Phylum")
##  [1] "Firmicutes"      "Bacteroidetes"   "Tenericutes"    
##  [4] "Proteobacteria"  "Cyanobacteria"   "Actinobacteria" 
##  [7] "Lentisphaerae"   "Fusobacteria"    "TM7"            
## [10] "Verrucomicrobia" "Synergistetes"

What if we wanted to change the rank_names returned in the previous chunk? We knew (in this case) that the “greengenes” taxonomy tools were used for the assignment, and so we provided an appropriate parsing function for labeling the taxonomic ranks. However, in some cases we might have less information, and arbitrary/“dummy” rank names that we want to rename later. The following code is not run, but would rename the rank names by assigning them to the colnames (column names) of the Bushman taxonomy table.

colnames(tax_table(Bushman)) <- c(k = "Kingdom", p = "Phylum", c = "Class", 
    o = "Order", f = "Family", g = "Genus", s = "Species")

Abundance Accessors

The purposes of the sample_sums and taxa_sums function are pretty straightforward, but the get_sample and get_taxa functions can bet a bit confusing.

sample_sums(Bushman)[1:10]
## C.4000.01.P1.405214 C.3016.01.P1.405202 C.3019.01.P1.405160 
##                6004                5642                7870 
## C.4011.01.P1.405200 C.4004.01.P1.405170 C.3007.01.P1.405153 
##                4758                3944               10111 
## C.3015.01.P1.405218 C.3009.01.P1.405180 C.3004.01.P1.405192 
##                6448                4825                7290 
## C.4014.01.P1.405150 
##                3275
taxa_sums(Bushman)[1:10]
## 248563 110059 223351 358030 367581  16076 313844  49837 517282 296166 
##     53     71    692     90  15559   1219      3     18      4      1
get_taxa(Bushman, sample_names(Bushman)[5])[1:10]
## 248563 110059 223351 358030 367581  16076 313844  49837 517282 296166 
##      0     46      0      0      0      0      0      0      0      0
get_sample(Bushman, taxa_names(Bushman)[5])[1:10]
## C.4000.01.P1.405214 C.3016.01.P1.405202 C.3019.01.P1.405160 
##                 255                   0                 114 
## C.4011.01.P1.405200 C.4004.01.P1.405170 C.3007.01.P1.405153 
##                 329                   0                   0 
## C.3015.01.P1.405218 C.3009.01.P1.405180 C.3004.01.P1.405192 
##                  58                 351                  77 
## C.4014.01.P1.405150 
##                   8

Note how a sample name is required by get_taxa, and vice versa. This might seem confusing at first, but get_taxa is returning all the OTU abundances from one sample, while get_sample is returning the abundances from all samples for one OTU.


Simple Summary Graphics

Note that we loaded some graphics related packages at the beginning of this demo. They will be used in some of these examples.

Some examples for plotting richness estimates from un-trimmed data.

plot_richness(Bushman)  #, 'sample_names', 'SampleType')

plot of chunk plot-richness-1

(p = plot_richness(Bushman, x = "SEX"))

plot of chunk plot-richness-1

p + geom_boxplot(data = p$data, aes(x = SEX, y = value, color = NULL), alpha = 0.1)

plot of chunk plot-richness-1

Some others you might try (not run in demo)

plot_richness(Bushman, x = "INSOLUBLE_DIETARY_FIBER_G_AVE")
plot_richness(Bushman, x = "AGE_IN_YEARS")
plot_richness(Bushman, x = "VEGETABLE_PROTEIN_G_AVE")

Plotting an Annotated Phylogenetic Tree

A useful display on a phylogenetic tree is to add points next to tips/leaves/OTUs to represent samples in which the OTU was observed. This is facilitated in the phyloseq package through the plot_tree function, which produces a ggplot-based phylogenetic tree, and also allows several options for mapping color, shape, and size of these sample points to variables in the dataset. These point aesthetics can be mapped to sample data or to taxonomic data, depending on needs and which information needs to be reinforced in your graphic.

Caution: Trying to plot too many taxa (tree tips) at once obscures meaning. Let's look at just the Chlamydiae phylum in the incldued GlobalPatterns dataset. Note that this also requires subsetting the GlobalPatterns dataset using the subset_taxa function, part of the “preprocessing” tools described in the following section.

data(GlobalPatterns)
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
GP.chl = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")

Map the sample environment ("SampleType") to point color, and taxonomic family to point shape. Additionally, label the tips with the Genus name and scale the point size by abundance

plot_tree(GP.chl, color = "SampleType", shape = "Family", label.tips = "Genus", 
    size = "abundance", plot.margin = 0.5, ladderize = TRUE)

plot of chunk plot_tree-chlam

This figure contains the graphic produced by the plot_tree function in phyloseq. In this case the data is a subset of the GlobalPatterns dataset in which only OTUs from the phylum Chlamydiae are included. Additionally, the tree has been annotated with genus labels at each tree tip. The points next to tips represent samples in which the OTU was observed, and are shaped according to taxonomic rank of Family, and shaded according to the sample type (sample source environment).

plot_tree(GP.chl, "treeonly", nodeplotblank)

plot of chunk unnamed-chunk-9

This figure is the result of plotting the “bare” tree with no options directing the mapping of a variable to the tree, and "treeonly" as the argument to method. Not as informative as the previous tree.

Abundance bar plots

For direct quantitative observation/comparison of abundances.

data(enterotype)
TopNOTUs = names(sort(taxa_sums(enterotype), TRUE)[1:10])
ent10 = prune_taxa(TopNOTUs, enterotype)
plot_bar(ent10, "SeqTech", fill = "Enterotype", facet_grid = ~Genus)

plot of chunk plot_bar-ex1

plot_bar(ent10, "Genus", fill = "Genus", facet_grid = SeqTech ~ Enterotype)

plot of chunk plot_bar-ex2

This one takes a little while to render (not run).

plot_bar(Bushman, "Phylum", NULL, 0.9, "SEX", "INSOLUBLE_DIETARY_FIBER_G_AVE")

Preprocessing Abundance Data

This section includes examples of preprocessing (filtering, trimming, subsetting, etc) phyloseq data. There is a whole tutorial page devoted to preprocessing, and more exmaples and discussion could always be added, it seems. For starters, we will reset the GlobalPatterns example data, add a human category, and show a few options for preprocessing here.

data(GlobalPatterns)
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
# prune OTUs that are not present in at least one sample
GP = prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
# Define a human-associated versus non-human categorical variable:
sample_data(GP)$human = factor(get_variable(GP, "SampleType") %in% c("Feces", 
    "Mock", "Skin", "Tongue"))

“Rarefy” Abundances to Even Depth

Although of perhaps dubious necessity, it is common for OTU abundance tables to be randomly subsampled to even sequencing depth prior various analyses, especially UniFrac. Here is an example comparing the UniFrac-PCoA results with and without “rarefying” the abundances (Requires phyloseq v1.1.28+)

Test with esophagus dataset

data(esophagus)
eso = rarefy_even_depth(esophagus)
## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`
## 
## ...
## 14OTUs were removed because they are no longer 
## present in any sample after random subsampling
## 
## ...
# plot(as(otu_table(eso), 'vector'), as(otu_table(esophagus), 'vector'))
UniFrac(eso)
##        B      C
## C 0.4981       
## D 0.4802 0.5288
UniFrac(esophagus)
##        B      C
## C 0.5176       
## D 0.5182 0.5422

Test with GlobalPatterns dataset

data(GlobalPatterns)
GP.chl = subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
# remove the samples that have less than 20 total reads from Chlamydiae
GP.chl = prune_samples(names(which(sample_sums(GP.chl) >= 20)), GP.chl)
# (p = plot_tree(GP.chl, color='SampleType', shape='Family',
# label.tips='Genus', size='abundance'))
GP.chl.r = rarefy_even_depth(GP.chl)
## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`
## 
## ...
## 1OTUs were removed because they are no longer 
## present in any sample after random subsampling
## 
## ...
# plot(as(otu_table(GP.chl.r), 'vector'), as(otu_table(GP.chl), 'vector'))

To compare MDS of unweighted-UniFrac for GP.chl and GP.chl.r (default distance is unweighted UniFrac)

plot_ordination(GP.chl, ordinate(GP.chl, "MDS"), color = "SampleType") + geom_point(size = 5)

plot of chunk unnamed-chunk-15

plot_ordination(GP.chl.r, ordinate(GP.chl.r, "MDS"), color = "SampleType") + 
    geom_point(size = 5)

plot of chunk unnamed-chunk-16

How does rarefying affect the larger untrimmed dataset? (not run)

GP.r = rarefy_even_depth(GP)
plot_ordination(GP, ordinate(GP), color = "SampleType") + geom_point(size = 5)
plot_ordination(GP.r, ordinate(GP.r), color = "SampleType") + geom_point(size = 5)

prune_taxa() vs. subset_taxa()

These are two very different methods for subsetting OTUs in a dataset.

topN = 20
most_abundant_taxa = sort(taxa_sums(GP), TRUE)[1:topN]
print(most_abundant_taxa)
##  549656  331820  279599  360229  317182   94166  158660  329744  550960 
## 2481214 1001492  927850  556441  528629  444142  443938  436539  384205 
##  189047  326977  317658  244304  171551  263681   98605   12812  536311 
##  312161  291577  234024  222365  215470  212089  203583  196550  196355 
##  192573  298875 
##  179312  177879
GP20 = prune_taxa(names(most_abundant_taxa), GP)
# Exploratory tree #1 (Note, too ma) plot_tree(ex2, color='SampleType',
# label.tips='Family', size='abundance')
ntaxa(GP20)
## [1] 20
length(get_taxa_unique(GP20, "Phylum"))
## [1] 5

That was prune_taxa, applicable when we know (or can provide) the OTU IDs of the OTUs we want to retain in the dataset, or a logical vector with the same length as ntaxa reseulting from a test (useful for genefilter orgenefilter_sample results (see next section)).

Alternatively, you can subset based on taxonomy expressions, using subset_taxa.

GP.chl = subset_taxa(GP, Phylum == "Chlamydiae")
# Exploratory tree #2 plot_tree(GP.chl, color='SampleType', shape='Family',
# label.tips='Genus', size='abundance')
ntaxa(GP.chl)
## [1] 21
length(get_taxa_unique(GP.chl, "Phylum"))
## [1] 1

filterfun_sample() and genefilter_sample()

This tool takes after the genefilter function from the genefilter package, but emphasizes within-microbiome conditions. The following code illustrates . The function topp is a filter function that returns the most abundant p fraction of taxa. The filterfun_sample function takes one or or more functions like topp and binds them in order to define a filtering protocol function, in this case called: f1. This function, f1, is then passed to genefilter_sample along with the dataset that is going to be pruned as well as a value for A, the number of samples in which an OTU must pass the filtering conditions.

topp(0.1)
## function (x) 
## {
##     if (na.rm) {
##         x = x[!is.na(x)]
##     }
##     x >= sort(x, decreasing = TRUE)[ceiling(length(x) * p)]
## }
## <environment: 0x12a067d40>
f1 = filterfun_sample(topp(0.1))
print(f1)
## function (x) 
## {
##     fun = flist[[1]]
##     fval = fun(x)
##     for (fun in flist[-1]) {
##         fval = fval & fun(x)
##     }
##     return(fval)
## }
## <environment: 0x124ec25b8>
## attr(,"class")
## [1] "filterfun"
wh1 = genefilter_sample(GP, f1, A = round(0.5 * nsamples(GP)))
sum(wh1)
## [1] 793
ex2 = prune_taxa(wh1, GP)
print(GP)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18988 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 18988 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 18988 tips and 18987 internal nodes ]
print(ex2)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 793 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 793 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 793 tips and 792 internal nodes ]

Filtering low-variance OTUs

Suppose we wanted to use the variance of OTUs across samples as a condition for filtering. For example, to remove OTUs that do not change much across all (or most) samples. Note that the value of the variance is highly-dependent on the sequencing effort of each sample (the total number of reads sequenced from a particular sample). Thus, we must first transform the sample counts to relative abundance, which is shown in more detail in the next section. The following code will create a version of the GP dataset in which the abundance values have been transformed to relative abundance within each sample, and then OTUs have been filtered to keep only those with variance greater than 0.00001 (assuming we wanted to pick an arbitrary threshold in this way).

GPr = transform_sample_counts(GP, function(x) x/sum(x))
GPf = filter_taxa(GPr, function(x) var(x) > 1e-05, TRUE)

Transformations

Useful for: Standardization / Normalization / Smoothing / Shrinking. Second-order function: transform_sample_counts. Also the threshrank and threshrankfun functions.

For more advanced normalization features (like shrinkage, etc.), also consider features in the edgeR package, the DESeq package, and for standardization the decostand function in the vegan-package; as well as probably many others that could be useful in this context.

Graphics for Inference and Exploration

In the following section(s) we will illustrate using graphical tools provided by the phyloseq package. These are meant to be flexible ways to explore and summarize the data.

For further details, there is a collection of “show and tell” tutorial-pages describing the available graphics options supported by the phyloseq-package. These include example code for reproducing the figures shown. Many of the default settings are modifiable within the function arguments directly, and virtually everything about these plots can be further modified via the layers interface of ggplot2.

For quick reference (even though some have been described already), the key graphics-producing functions in phyloseq are:

plot_heatmap

plot_tree

plot_ordination

plot_network

plot_richness

plot_bar

Let's (re)load the Global Patterns dataset, prune the empty taxa, and add a custom sample variable called “human”, a logical indicating whether or not the samples are human-associated.

data(GlobalPatterns)  # Reload GlobalPatterns
GP = prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
sample_data(GP)$human = factor(get_variable(GP, "SampleType") %in% c("Feces", 
    "Mock", "Skin", "Tongue"))

We are going to show some examples that would take a lot of time to calculate and render on a dataset as large as GlobalPatterns. For the sake of time in re-running these examples, let's subset the data further to the most abundant 5 phyla.

top5ph = sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), TRUE)[1:5]
GP1 = subset_taxa(GP, Phylum %in% names(top5ph))

Distance Functions

Before we get to network plots, let's discuss distances. Many tools use distances to perform their calculations. In phyloseq, ordinations, heatmaps, and network plots all use the distance function for calculating OTU or Sample distance matrices (actually represented as a “dist” object) when needed, particularly when PCoA/MDS or NMDS is involved.

help("distance")  # Same as '?distance'
`?`(distance)

A relatively recent, popular distance method that relies heavily on the phylogenetic tree is UniFrac. It has been implemented in phyloseq as a fast parallel function, also wrapped by the distance function. See the phyloseq demo page specifically devoted to Fast Parallel UniFrac for further details, citations, and performance results.

data(esophagus)
distance(esophagus)  # unweighted UniFrac
##        B      C
## C 0.5176       
## D 0.5182 0.5422
distance(esophagus, weighted = TRUE)  # weighted UniFrac
##        B      C
## C 0.2035       
## D 0.2603 0.2477

Here are some other examples. There are some 45 or so methods.

distance(esophagus, "jaccard")  # vegdist jaccard
distance(esophagus, "bray")  # vegdist bray-curtis
distance(esophagus, "gower")  # vegdist option 'gower'
distance(esophagus, "g")  # designdist method option 'g'
distance(esophagus, "minkowski")  # invokes a method from the base dist() function.
distance(esophagus, "(A+B-2*J)/(A+B)")  # designdist custom distance
distance("help")
distance("list")

The plot_network function

The following code illustrates using the make_network and plot_network functions in phyloseq. In this context, we are using networks to graphically represent thresholded distances between samples or OTUs. The euclidean distances between points on the plot are essentially arbitrary, only the “edges” (lines) between “nodes” (OTUs/samples) are derived directly from the data. For further examples, it is recommended that you take a look at the plot_network tutorial

The GP variable is an only-slightly-modified version of the GlobalPatterns dataset. The threshold was determined empirically to show something interesting for demonstration. In practice, this value has a huge effect on the resulting network, and its usefulness, and it is highly recommended that you investigate the results from multiple values.

ig = make_network(GP, type = "samples", distance = "bray", max.dist = 0.85)
plot_network(ig, GP, color = "SampleType", shape = "human", line_weight = 0.4, 
    label = NULL)

plot of chunk plotnetwork-GPsamples

A similar network representation of samples from the “Enterotypes” dataset.

data(enterotype)
ig = make_network(enterotype, max.dist = 0.3)
plot_network(ig, enterotype, color = "SeqTech", shape = "Enterotype", line_weight = 0.4, 
    label = NULL)

plot of chunk plotnetwork-entsamples

An example showing a network representation of OTUs, representing communities of bacteria that occurr in similar profiles of samples.

data(GlobalPatterns)
# prune to just the top 100 most abundant OTUs across all samples (crude).
GP100 = prune_taxa(names(sort(taxa_sums(GlobalPatterns), TRUE))[1:100], GlobalPatterns)
jg = make_network(GP100, "taxa", "jaccard", 0.3)
plot_network(jg, GP100, "taxa", color = "Phylum", line_weight = 0.4, label = NULL)