DIY: public restroom bacteria

Goals of this demonstration

Highlight an interesting use of R (analyzing bacteria in restrooms)
Demonstrate for R users how to access publicly-available microbiome census data. In this case counts of bacteria from a survey of public restroom sufaces in 2011.
Scholarly reproduction / reanalysis of the findings/techniques in the original article, noting places where reproduction is difficult or impossible and where there may be weaknesses in the statistical approach.
Demonstrate the use of phyloseq and R for conducting reproducible statistical analysis of microbiome census data.
Demonstrate some of the powerful features of the ggplot2 package in creating graphics about this data.

The original article by Flores et al. was published in 2011 under an open access license by the journal PLoS ONE. As of the time of this writing, it has been viewed almost 22,000 times. Flores GE, Bates ST, Knights D, Lauber CL, Stombaugh J, et al. (2011) Microbial Biogeography of Public Restroom Surfaces. PLoS ONE 6(11): e28132.

The data is hosted at microbio.me/qiime, with Study ID 1335, Project Name Flores_restroom_surface_biogeography. The zipfile of the data is exposed at this FTP address – note that you may have to create an account and login to access the .zip file. Otherwise I would include downloading and unzipping as part of the R source code in this demo.

The time that this demo was built: Sun May 5 14:51:33 2013

Study design and findings

Here is an illustration of the study design (their Figure 3), namely, where in public restrooms they sampled; as well as a few overall observations that they included in their caption.

Cartoon illustrations of the relative abundance of discriminating taxa on public restroom surfaces. Light blue indicates low abundance while dark blue indicates high abundance of taxa. (Panel A) Although skin-associated taxa (Propionibacteriaceae, Corynebacteriaceae, Staphylococcaceae and Streptococcaceae) were abundant on all surfaces, they were relatively more abundant on surfaces routinely touched with hands. (Panel B) Gut-associated taxa (Clostridiales, Clostridiales group XI, Ruminococcaceae, Lachnospiraceae, Prevotellaceae and Bacteroidaceae) were most abundant on toilet surfaces. (Panel C) Although soil-associated taxa (Rhodobacteraceae, Rhizobiales, Microbacteriaceae and Nocardioidaceae) were in low abundance on all restroom surfaces, they were relatively more abundant on the floor of the restrooms we surveyed. Figure not drawn to scale.

You can also check out this image at the original article.

Load phyloseq and other packages

library("phyloseq")
packageVersion("phyloseq")

## [1] '1.5.8'

I also want to make some nice graphics using the ggplot2 package, so I will also load that and adjust its default theme

library("ggplot2")
packageVersion("ggplot2")

## [1] '0.9.3.1'

theme_set(theme_bw())

Import data

The microbio.me/qiime site has provided a zip file with two key data files. Namely, a file with the abundance data and some metadata, and a tab delimited text file with sample meta data. This following code chunk imports both files, and then combines them in one integrated phyloseq object that we will use in the remaining (re)analysis.

First you need to download the file directly from the link provided above, or use the zipfile we have mirrored in our phyloseq-demo GitHub repository.

Here is how you would download it from the FTP address to a random temporary file defined by your operating system. We will call this file zipfile.

zipftp = "ftp://thebeast.colorado.edu/pub/QIIME_DB_Public_Studies/study_1335_split_library_seqs_and_mapping.zip"
zipfile = tempfile("RestroomBiogeography")
download.file(zipftp, zipfile)

Since I have reposted the zipped data file on the phyloseq-demo repository on GitHub – in the same directory as the original source file for this post – the zipfile in this example can also be defined locally, as shown here:

zipfile = "study_1335_split_library_seqs_and_mapping.zip"

In either case, it is now possible with R to unzip to a randomly labeled temporary diretory created specific to your operating system, which we will name with the symbol import_dir. My understanding is that eventually temporary files and directories are deleted at system shutdown or startup. Probably system dependent exactly when they get deleted.

Here is how to create the temporary directory for the unpacked file(s) from the .zip file.

import_dir <- tempdir()
unzip(zipfile, exdir = import_dir)

Import from the .biom file.

biomfile = paste0(import_dir, "/study_1335_closed_reference_otu_table.biom")
biom = import_biom(biomfile, parseFunction = parse_taxonomy_greengenes)

Import from the sample data file.

sdfile = paste0(import_dir, "/study_1335_mapping_file.txt")
sample_metadata = import_qiime_sample_data(sdfile)

Combine (merge) the two data objects.

restroom = merge_phyloseq(biom, sample_metadata)

Now we have available a new combined data object, called restroom, that contains all the data we should need for this tutorial. We can query different features of restroom using the phyloseq API in the form of accessor functions/methods. We can also just print it to standard out and we get some summary information.

restroom

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4467 taxa and 56 samples ]
## sample_data() Sample Data:       [ 56 samples by 42 sample variables ]
## tax_table()   Taxonomy Table:    [ 4467 taxa by 7 taxonomic ranks ]

rank_names(restroom)

## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

sample_variables(restroom)

##  [1] "X.SampleID"                    "BarcodeSequence"              
##  [3] "LinkerPrimerSequence"          "TARGET_SUBFRAGMENT"           
##  [5] "ASSIGNED_FROM_GEO"             "EXPERIMENT_CENTER"            
##  [7] "TITLE"                         "RUN_PREFIX"                   
##  [9] "TAXON_ID"                      "DEPTH"                        
## [11] "COMMON_NAME"                   "ELEVATION"                    
## [13] "RUN_DATE"                      "COLLECTION_DATE"              
## [15] "GENDER"                        "ALTITUDE"                     
## [17] "LEVEL"                         "ENV_BIOME"                    
## [19] "PLATFORM"                      "COUNTRY"                      
## [21] "SAMPLE_CENTER"                 "SAMP_SIZE"                    
## [23] "FLOOR"                         "LONGITUDE"                    
## [25] "STUDY_ID"                      "EXPERIMENT_DESIGN_DESCRIPTION"
## [27] "Description_duplicate"         "SEQUENCING_METH"              
## [29] "ENV_MATTER"                    "TARGET_GENE"                  
## [31] "BUILDING"                      "ENV_FEATURE"                  
## [33] "KEY_SEQ"                       "AGE_IN_YEARS"                 
## [35] "RUN_CENTER"                    "SURFACE"                      
## [37] "PCR_PRIMERS"                   "LIBRARY_CONSTRUCTION_PROTOCOL"
## [39] "LATITUDE"                      "REGION"                       
## [41] "STUDY_CENTER"                  "Description"

levels(sample_data(restroom)$BUILDING)

## [1] "Ekeley" "Porter"

levels(sample_data(restroom)$SURFACE)

##  [1] "Door in"             "Door out"            "Faucet handles"     
##  [4] "Sink floor"          "Soap dispenser"      "Stall in"           
##  [7] "Stall out"           "Toilet Floor"        "Toilet flush handle"
## [10] "Toilet seat"         "Water"

Okay, let's move on to preprocessing the data as needed, and reproducing the figures in the article.

Preprocessing

The phyloseq package includes a tutorial on preprocessing microiome census data with several broad and flexible methods. Here we want to use one of those preprocessing approaches as well as the one used by Flores et al. in this article. Unfortunately, they used random subsampling of their data with on reported seed or random number generation method, so it is impossible to exactly recapitulate their randomly subsampled data. However, we can do our best using a function included in phyloseq for exactly this purpose, rarefy_even_depth.

Are there any OTUs included in this dataset that have no counted reads at all prior to preprocessing? If so, how many?

any(taxa_sums(restroom) == 0)

## [1] TRUE

sum(taxa_sums(restroom) == 0)

## [1] 1040

So of the 4467 OTUs in the dataset, there were 1040 OTUs that were apparently not observed even once in any of the samples (20%). This sounds very odd, but more likely this is evidence of some data “massaging” that removed the abundance values of those taxa, but their entries in the .biom file are inexplicably included.

Here I remove those “unobserved” OTUs. Note how I first save the originally-imported data as restroom0 in case I want to go back to it after modifying/preprocessing restroom.

restroom0 = restroom
restroom = prune_taxa(taxa_sums(restroom) > 0, restroom)

Any empty samples? Apparently not, though the code for pruning “empty” samples is also shown. And procedes quickly since there is nothing in restroom to modify.

any(sample_sums(restroom) == 0)

## [1] FALSE

restroom = prune_samples(sample_sums(restroom) > 0, restroom)

What about the total reads per sample, and what does the distribution look like?

readsumsdf = data.frame(nreads = sort(taxa_sums(restroom), TRUE), sorted = 1:ntaxa(restroom), 
    type = "OTUs")
readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(restroom), 
    TRUE), sorted = 1:nsamples(restroom), type = "Samples"))
title = "Total number of reads"
p = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity")
p + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")

plot of chunk unnamed-chunk-1

This looks pretty typical for the distribution of reads from an amplicon-based microbiome census, if not even surprisingly evenly distributed across most samples… I've seen much, much worse. In any case, one of the main preprocessing steps we will do is to transform the abundances of each sample in a way that will make the represented “sequencing effort”, AKA the sum of the counts, the same across all samples. As mentioned earlier, I will redo a random subsampling (“rarefying”) to 500 reads as the original authors did; and, separately, I will also simply transform the abundance vectors of each sample by multiplying the OTU proportion within each sample by the arbitrary value, 500 used in the random subsampling. We can later compare the how the two datasets perform.

First, I will remove the samples with a total number of reads below the threshold set by the original uauthors of 500.

restroom = prune_samples(sample_sums(restroom) > 500, restroom)

This resulted in the removal of 1 sample(s) from the dataset.

Transform to equal sample sum

Rarefy We will call this restroomR. The way to accomplish this in phyloseq is to first define a random number seed so that others can reproduce your subsetted data, and then use the rarefy_even_depth function. Here I use the “ePub” number assigned to this article by PLoS ONE as the seed. Seemed appropriate, but totally arbitrary.

set.seed(28132)
restroomR = rarefy_even_depth(restroom, sample.size = 500)

And now an alternative to random subsampling, a simple proportional transformation, calling it restroomP.

restroomP = transform_sample_counts(restroom, function(x) 500 * x/sum(x))

For sanity-check, let's replot the sample sums of each of these new data objects, to convince ourselves that all of the samples now sum to 500.

par(mfrow = c(1, 2))
title = "Sum of reads for each sample, restroomR"
plot(sort(sample_sums(restroomR), TRUE), type = "h", main = title, ylab = "reads", 
    ylim = c(0, 1000))
title = "Sum of reads for each sample, restroomP"
plot(sort(sample_sums(restroomP), TRUE), type = "h", main = title, ylab = "reads", 
    ylim = c(0, 1000))

plot of chunk sample-sums-transformed

Figure 1

Original Figure

Original Figure caption, quoted here: “Figure 1. Taxonomic composition of bacterial communities associated with public restroom surfaces.(A) Average composition of bacterial communities associated with restroom surfaces and potential source environments. (B) Taxonomic differences were observed between some surfaces in male and female restrooms. Only the 19 most abundant taxa are shown. For a more detailed taxonomic breakdown by gender including some of the variation see Supplemental Table S2.”

Figure 1 - Reproduced

I will use the phyloseq plotting function plot_bar to reproduce Figure 1. I have taken liberties to improve the representation of the data to make certain features more comparable.

First, they have mixed taxonomic ranks in this plot between family and order, but most are family so I will use that. Furthermore, they showed only the most-abundant 19 OTUs, so we will add a variant taxonomic rank to the data, called Family19, that will only have non-NA values if the OTU is among those top 19.

top19otus = names(sort(taxa_sums(restroomR), TRUE)[1:19])
taxtab19 = cbind(tax_table(restroomR), family19 = NA)
taxtab19[top19otus, "family19"] <- as(tax_table(restroomR)[top19otus, "Family"], 
    "character")
tax_table(restroomR) <- tax_table(taxtab19)

Now create the abundance bar plot of restroomR using the plot_bar function.

title = "Figure 1 Part A (remake), attempt 1"
plot_bar(restroomR, "SURFACE", fill = "family19", title = title) + coord_flip()

plot of chunk Figure1-redo-01-A

Hey that was easy! But the abundance values get over 500. How come? Oh, each of these categories includes more than one sample, so we expect them to exceed 500. In fact, you can see that the totals are multiples of 500. This is easy to fix by re-doing our earlier transformation on restroomR after mergeing by category. The phyloseq package includes a function specifically for doing this kind of categorical merge, called merge_samples.

restroomRm = merge_samples(restroomR, "SURFACE")

Repair the merged values associated with each surface after merge.

sample_data(restroomRm)$SURFACE <- levels(sample_data(restroomR)$SURFACE)

Transform to percentages of total available.

restroomRm = transform_sample_counts(restroomRm, function(x) 100 * x/sum(x))

Now we can plot Figure 1A with each category on equivalent footing.

title = "Figure 1 Part A (remake), attempt 2"
plot_bar(restroomRm, "SURFACE", fill = "family19", title = title) + coord_flip() + 
    ylab("Percentage of Sequences")

plot of chunk Figure1-redo-01-A-2

To better match Figure 1 Panel A from the original article, I can remove the gray rectangles that represent OTUs that were not among the most abundant 19.

restroomRm19 = prune_taxa(top19otus, restroomRm)
title = "Figure 1 Part A (remake), attempt 3"
plot_bar(restroomRm19, "SURFACE", fill = "family19", title = title) + coord_flip() + 
    ylab("Percentage of Sequences") + ylim(0, 100)

plot of chunk prune-non-19

Re-plot of Figure 1A from original article using phyloseq. Each rectangle is a different OTU. Since these are multi-sample categories, rather than simply samples, the length of each bar represents the mean fraction abundance of that OTU among the normalized samples in that category. Multiple rectangles with the same color and category represent relative abundances of different OTUs in the same taxonomic family. Gray boxes are OTUs that were among the top 19 most abundant, but did not have a taxonomic family classification in the taxonomy table.

Figure 1, Panel B – Gender

What about separating by gender?

First, we will merge samples by a dummy variable representing both GENDER and SURFACES, repair flattened factors, and transform to proportional abundances.

sample_data(restroomR)$gensurf <- paste0(sample_data(restroomR)$GENDER, sample_data(restroomR)$SURFACE)
restroomRgsm = merge_samples(restroomR, "gensurf")

## Warning: NAs introduced by coercion

# repair factors
sample_data(restroomRgsm)$SURFACE <- levels(sample_data(restroomR)$SURFACE)[get_variable(restroomRgsm, 
    "SURFACE")]
sample_data(restroomRgsm)$GENDER <- levels(sample_data(restroomR)$GENDER)[get_variable(restroomRgsm, 
    "GENDER")]
# transform to proportions
restroomRgsmp = transform_sample_counts(restroomRgsm, function(x) 100 * x/sum(x))

Now plot with panelling by surface for an easier gender-wise comparison than in the original figure.

restroomRgsm19 = prune_taxa(top19otus, restroomRgsmp)
title = "Figure 1 Part B, Restroom Surface and Gender"
p = plot_bar(restroomRgsm19, "GENDER", fill = "family19", title = title) + coord_flip() + 
    labs(colour = "family")
p + facet_wrap(~SURFACE)

plot of chunk Figure-01-partB-v01

Figure 2 - Dimensional Reduction

The original authors use Multidimensional Scaling (also called Principle Coordinates Analysis) to decompose a pairwise distance matrix between all the microbial samples. The original distances are calculated by what microbial ecologists call the “unweighted UniFrac” distance. This approach is useful for graphically displaying high-level trends across all the samples in the dataset, usually by overlaying additional information about each sample, for instance, where it came from.

Unfortunately, calculating the UniFrac distance between each sample requires having an evolutionary (phylogenetic) tree of the bacterial species in the dataset, and this wasn't provided. We will instead attempt to create an equally-illuminating scatterplot using a different distance matrix.

Figure 2 - Original Article

Figure 2. Relationship between bacterial communities associated with ten public restroom surfaces. Communities were clustered using PCoA of the unweighted UniFrac distance matrix. Each point represents a single sample. Note that the floor (triangles) and toilet (asterisks) surfaces form clusters distinct from surfaces touched with hands.

Figure 2 - Reproduced

The multidimensional scaling of a UniFrac distance matrix is one instance of a larger category of what we call ordination methods. These methods attempt to decompose the variability of higher-dimensional data into a smaller number of orthogonal (perpendicular) axes that turn out to be pretty useful for plotting and certain clustering methods.

title = "Figure 2 remake, PCoA of Bray-Curtis distance"
restroom_ord = ordinate(restroomR, "PCoA", "bray")
p = plot_ordination(restroomR, restroom_ord, color = "SURFACE", shape = "GENDER")
p = p + geom_point(size = 6, alpha = 0.7) + ggtitle(title)
p

plot of chunk phyloseq-ordinate-figure02

p + facet_wrap(~SURFACE)

plot of chunk phyloseq-ordinate-figure02

p + facet_wrap(~GENDER)  #, nrow=3)

Table 1

Table 1 original article. ANOSIM test. Results of pairwise comparisons for unweighted UniFrac distances of bacterial communities associated with various surfaces of public restrooms on the University of Colorado campus using the ANOSIM test in Primer v6.

In addition to Primer v6, the ANOSIM method is implemented in the vegan package in R, and described nicely in the anosim documentation. It is designed to operate directly on a dissimilarity (distance) matrix, so it is unclear why the authors first decomposed their unweighted UniFrac distance matrix with Principle Coordinate Analysis, and then provided those coordinates (presumably from just the first two axes?) to the ANOSIM implementation in Primer v6.

Important note about multiple testing The anosim function performs a non-parametric test of the significance of the sample-grouping you provide against a permutation-based null distribution, generated by randomly permuting the sample labels many times (999 permutations is the default, used here). The table shown in the original article appears to be the result of the 45 two-surface (pairwise) anosim tests, excluding water. The precise details are not explained, and we cannot repeat them here because (1) we cannot exactly reproduce the randomly subsampled (“rarefied”) abundance data from the somewhat raw abundance matrix that is hosted on microbio.me, and more importantly (2) the dataset does not include the phylogenetic tree that is necessary for calculating the unweighted UniFrac distance matrix. Critically, the authors made no mention of any correction for having conducted 45 simultaneous tests. They did take a slightly stricter-than-usual threshold for significance of P <= 0.01, but we don't know without a formal correction if the tests with P close to 0.01 are actually significant (even at the more common 0.05 level), and the P-values themselves were not reported, just the statistic with and without asterisk.

Here is how to re-perform anosim using R and the vegan package:

library("vegan")
packageVersion("vegan")

## [1] '2.0.7'

First look at results for surface types.

surface_group = get_variable(restroomR, "SURFACE")
surface_ano = anosim(distance(restroomR, "bray"), surface_group)
surface_ano$signif

## [1] 0.001

surface_ano$statistic

## [1] 0.5514

Now test bathroom genders. First remove samples for which there was no gender to make this a pairwise test.

restroomRg = subset_samples(restroomR, !GENDER == "None")
gender_group = get_variable(restroomRg, "GENDER")
gender_ano = anosim(distance(restroomRg, "bray"), gender_group)
gender_ano$signif

## [1] 0.118

gender_ano$statistic

## [1] 0.04061

Here we settled for a Bray-Curtis distance matrix for our own version of just two anosim calculations, one of which would not be considered significant by most standards (gender).

Add a demonstration of a multiple-hypothesis correction?

I considered adding a demonstration of how to add a multiple correction for the approach the authors took, but I'm not convinced (yet) that this was the optimal statistical approach to arrive at the knowledge that they were seeking. For one thing, it quickly loses power as the number of categories increases because of the larger number of corrections, not to mention it becomes difficult to calculate. Practicality aside, even the documentation for anosim in R suggests users use adonis (a permutation MANOVA, also in the vegan package) as “a more robust alternative”.

df = as(sample_data(restroomR), "data.frame")
d = distance(restroomR, "bray")
restroomadonis = adonis(d ~ SURFACE + GENDER + BUILDING + FLOOR, df)
restroomadonis

## 
## Call:
## adonis(formula = d ~ SURFACE + GENDER + BUILDING + FLOOR, data = df) 
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model    R2 Pr(>F)    
## SURFACE   10      8.19   0.819    4.12 0.470  0.001 ***
## GENDER     1      0.50   0.504    2.53 0.029  0.007 ** 
## BUILDING   1      0.31   0.310    1.56 0.018  0.081 .  
## FLOOR      2      0.47   0.236    1.19 0.027  0.212    
## Residuals 40      7.96   0.199         0.456           
## Total     54     17.44                 1.000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(restroomadonis$aov.tab)

plot of chunk adonis

Figure 4 – Microbial Source Tracking

Figure 4 – Original Figure

Soil source

The soil source data was taken from this 2009 article by Lauber et al. in AEM

SourceTracker

The authors describe using an algorithm called “SourceTracker”, previously described in Knights et al. (2011) Nature Methods, and provided as a commented R script published at sourceforge/sourcetracker.

In perhaps another post, or a continuation of this one, I will include their code and apply it to the available data here. One additional challenge is the availability of the refernce data, which was not repackaged or republished with this dataset. Hopefully it is all still available.

Feedback welcome!

Please feel free to post comments, suggestions. Also, ideas for other demos are strongly encouraged. A good place for that is probably the issue tracker for the phyloseq-demo repository where this document is versioned and where many other phyloseq-related demos are hosted.

Thanks to original authors

Thanks to original authors for an interesting study and for making (most of) their data so publicly available. I just subjected it to some scholarly scrutiny, which is a good thing (and I hope they agree); and only possible because of what they made available. If you like the idea of being able to re-analyze this and other studies, then clamor for better reproducible research, which should hopefully the original and intermediate data as well as analysis source code … and strive to do the same in your own publications. We will all do better, faster work because of it :)