Originally hosted here
The original article by Flores et al. was published in 2011 under an open access license by the journal PLoS ONE. As of the time of this writing, it has been viewed almost 22,000 times. Flores GE, Bates ST, Knights D, Lauber CL, Stombaugh J, et al. (2011) Microbial Biogeography of Public Restroom Surfaces. PLoS ONE 6(11): e28132.
The data is hosted at microbio.me/qiime, with Study ID 1335, Project Name
Flores_restroom_surface_biogeography. The zipfile of the data is exposed at this FTP address – note that you may have to create an account and login to access the
.zip file. Otherwise I would include downloading and unzipping as part of the R source code in this demo.
The time that this demo was built: Sun May 5 14:51:33 2013
Here is an illustration of the study design (their Figure 3), namely, where in public restrooms they sampled; as well as a few overall observations that they included in their caption.
You can also check out this image at the original article.
##  '1.5.8'
I also want to make some nice graphics using the ggplot2 package, so I will also load that and adjust its default theme
##  '0.9.3.1'
The microbio.me/qiime site has provided a zip file with two key data files. Namely, a file with the abundance data and some metadata, and a tab delimited text file with sample meta data. This following code chunk imports both files, and then combines them in one integrated phyloseq object that we will use in the remaining (re)analysis.
First you need to download the file directly from the link provided above, or use the zipfile we have mirrored in our phyloseq-demo GitHub repository.
Here is how you would download it from the FTP address to a random temporary file defined by your operating system. We will call this file
zipftp = "ftp://thebeast.colorado.edu/pub/QIIME_DB_Public_Studies/study_1335_split_library_seqs_and_mapping.zip" zipfile = tempfile("RestroomBiogeography") download.file(zipftp, zipfile)
Since I have reposted the zipped data file on the phyloseq-demo repository on GitHub – in the same directory as the original source file for this post – the
zipfile in this example can also be defined locally, as shown here:
zipfile = "study_1335_split_library_seqs_and_mapping.zip"
In either case, it is now possible with R to unzip to a randomly labeled temporary diretory created specific to your operating system, which we will name with the symbol
import_dir. My understanding is that eventually temporary files and directories are deleted at system shutdown or startup. Probably system dependent exactly when they get deleted.
Here is how to create the temporary directory for the unpacked file(s) from the
import_dir <- tempdir() unzip(zipfile, exdir = import_dir)
Import from the
biomfile = paste0(import_dir, "/study_1335_closed_reference_otu_table.biom") biom = import_biom(biomfile, parseFunction = parse_taxonomy_greengenes)
Import from the sample data file.
sdfile = paste0(import_dir, "/study_1335_mapping_file.txt") sample_metadata = import_qiime_sample_data(sdfile)
Combine (merge) the two data objects.
restroom = merge_phyloseq(biom, sample_metadata)
Now we have available a new combined data object, called
restroom, that contains all the data we should need for this tutorial. We can query different features of
restroom using the phyloseq API in the form of accessor functions/methods. We can also just print it to standard out and we get some summary information.
## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 4467 taxa and 56 samples ] ## sample_data() Sample Data: [ 56 samples by 42 sample variables ] ## tax_table() Taxonomy Table: [ 4467 taxa by 7 taxonomic ranks ]
##  "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
##  "X.SampleID" "BarcodeSequence" ##  "LinkerPrimerSequence" "TARGET_SUBFRAGMENT" ##  "ASSIGNED_FROM_GEO" "EXPERIMENT_CENTER" ##  "TITLE" "RUN_PREFIX" ##  "TAXON_ID" "DEPTH" ##  "COMMON_NAME" "ELEVATION" ##  "RUN_DATE" "COLLECTION_DATE" ##  "GENDER" "ALTITUDE" ##  "LEVEL" "ENV_BIOME" ##  "PLATFORM" "COUNTRY" ##  "SAMPLE_CENTER" "SAMP_SIZE" ##  "FLOOR" "LONGITUDE" ##  "STUDY_ID" "EXPERIMENT_DESIGN_DESCRIPTION" ##  "Description_duplicate" "SEQUENCING_METH" ##  "ENV_MATTER" "TARGET_GENE" ##  "BUILDING" "ENV_FEATURE" ##  "KEY_SEQ" "AGE_IN_YEARS" ##  "RUN_CENTER" "SURFACE" ##  "PCR_PRIMERS" "LIBRARY_CONSTRUCTION_PROTOCOL" ##  "LATITUDE" "REGION" ##  "STUDY_CENTER" "Description"
##  "Ekeley" "Porter"
##  "Door in" "Door out" "Faucet handles" ##  "Sink floor" "Soap dispenser" "Stall in" ##  "Stall out" "Toilet Floor" "Toilet flush handle" ##  "Toilet seat" "Water"
Okay, let's move on to preprocessing the data as needed, and reproducing the figures in the article.
The phyloseq package includes a tutorial on preprocessing microiome census data with several broad and flexible methods. Here we want to use one of those preprocessing approaches as well as the one used by Flores et al. in this article. Unfortunately, they used random subsampling of their data with on reported seed or random number generation method, so it is impossible to exactly recapitulate their randomly subsampled data. However, we can do our best using a function included in phyloseq for exactly this purpose,
Are there any OTUs included in this dataset that have no counted reads at all prior to preprocessing? If so, how many?
any(taxa_sums(restroom) == 0)
##  TRUE
sum(taxa_sums(restroom) == 0)
##  1040
So of the 4467 OTUs in the dataset, there were 1040 OTUs that were apparently not observed even once in any of the samples (20%). This sounds very odd, but more likely this is evidence of some data “massaging” that removed the abundance values of those taxa, but their entries in the
.biom file are inexplicably included.
Here I remove those “unobserved” OTUs. Note how I first save the originally-imported data as
restroom0 in case I want to go back to it after modifying/preprocessing
restroom0 = restroom restroom = prune_taxa(taxa_sums(restroom) > 0, restroom)
Any empty samples? Apparently not, though the code for pruning “empty” samples is also shown. And procedes quickly since there is nothing in
restroom to modify.
any(sample_sums(restroom) == 0)
##  FALSE
restroom = prune_samples(sample_sums(restroom) > 0, restroom)
What about the total reads per sample, and what does the distribution look like?
readsumsdf = data.frame(nreads = sort(taxa_sums(restroom), TRUE), sorted = 1:ntaxa(restroom), type = "OTUs") readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(restroom), TRUE), sorted = 1:nsamples(restroom), type = "Samples")) title = "Total number of reads" p = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity") p + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")
This looks pretty typical for the distribution of reads from an amplicon-based microbiome census, if not even surprisingly evenly distributed across most samples… I've seen much, much worse. In any case, one of the main preprocessing steps we will do is to transform the abundances of each sample in a way that will make the represented “sequencing effort”, AKA the sum of the counts, the same across all samples. As mentioned earlier, I will redo a random subsampling (“rarefying”) to 500 reads as the original authors did; and, separately, I will also simply transform the abundance vectors of each sample by multiplying the OTU proportion within each sample by the arbitrary value,
500 used in the random subsampling. We can later compare the how the two datasets perform.
First, I will remove the samples with a total number of reads below the threshold set by the original uauthors of
restroom = prune_samples(sample_sums(restroom) > 500, restroom)
This resulted in the removal of 1 sample(s) from the dataset.
We will call this
restroomR. The way to accomplish this in phyloseq is to first define a random number seed so that others can reproduce your subsetted data, and then use the
rarefy_even_depth function. Here I use the “ePub” number assigned to this article by PLoS ONE as the seed. Seemed appropriate, but totally arbitrary.
set.seed(28132) restroomR = rarefy_even_depth(restroom, sample.size = 500)
And now an alternative to random subsampling, a simple proportional transformation, calling it
restroomP = transform_sample_counts(restroom, function(x) 500 * x/sum(x))
For sanity-check, let's replot the sample sums of each of these new data objects, to convince ourselves that all of the samples now sum to 500.
par(mfrow = c(1, 2)) title = "Sum of reads for each sample, restroomR" plot(sort(sample_sums(restroomR), TRUE), type = "h", main = title, ylab = "reads", ylim = c(0, 1000)) title = "Sum of reads for each sample, restroomP" plot(sort(sample_sums(restroomP), TRUE), type = "h", main = title, ylab = "reads", ylim = c(0, 1000))
I will use the phyloseq plotting function
plot_bar to reproduce Figure 1. I have taken liberties to improve the representation of the data to make certain features more comparable.
First, they have mixed taxonomic ranks in this plot between family and order, but most are family so I will use that. Furthermore, they showed only the most-abundant 19 OTUs, so we will add a variant taxonomic rank to the data, called
Family19, that will only have non-NA values if the OTU is among those top 19.
top19otus = names(sort(taxa_sums(restroomR), TRUE)[1:19]) taxtab19 = cbind(tax_table(restroomR), family19 = NA) taxtab19[top19otus, "family19"] <- as(tax_table(restroomR)[top19otus, "Family"], "character") tax_table(restroomR) <- tax_table(taxtab19)
Now create the abundance bar plot of
restroomR using the
title = "Figure 1 Part A (remake), attempt 1" plot_bar(restroomR, "SURFACE", fill = "family19", title = title) + coord_flip()
Hey that was easy! But the abundance values get over 500. How come? Oh, each of these categories includes more than one sample, so we expect them to exceed 500. In fact, you can see that the totals are multiples of 500. This is easy to fix by re-doing our earlier transformation on
restroomR after mergeing by category. The phyloseq package includes a function specifically for doing this kind of categorical merge, called
restroomRm = merge_samples(restroomR, "SURFACE")
Repair the merged values associated with each surface after merge.
sample_data(restroomRm)$SURFACE <- levels(sample_data(restroomR)$SURFACE)
Transform to percentages of total available.
restroomRm = transform_sample_counts(restroomRm, function(x) 100 * x/sum(x))
Now we can plot Figure 1A with each category on equivalent footing.
title = "Figure 1 Part A (remake), attempt 2" plot_bar(restroomRm, "SURFACE", fill = "family19", title = title) + coord_flip() + ylab("Percentage of Sequences")
To better match Figure 1 Panel A from the original article, I can remove the gray rectangles that represent OTUs that were not among the most abundant 19.
restroomRm19 = prune_taxa(top19otus, restroomRm) title = "Figure 1 Part A (remake), attempt 3" plot_bar(restroomRm19, "SURFACE", fill = "family19", title = title) + coord_flip() + ylab("Percentage of Sequences") + ylim(0, 100)
What about separating by gender?
First, we will merge samples by a dummy variable representing both
SURFACES, repair flattened factors, and transform to proportional abundances.
sample_data(restroomR)$gensurf <- paste0(sample_data(restroomR)$GENDER, sample_data(restroomR)$SURFACE) restroomRgsm = merge_samples(restroomR, "gensurf")
## Warning: NAs introduced by coercion
# repair factors sample_data(restroomRgsm)$SURFACE <- levels(sample_data(restroomR)$SURFACE)[get_variable(restroomRgsm, "SURFACE")] sample_data(restroomRgsm)$GENDER <- levels(sample_data(restroomR)$GENDER)[get_variable(restroomRgsm, "GENDER")] # transform to proportions restroomRgsmp = transform_sample_counts(restroomRgsm, function(x) 100 * x/sum(x))
Now plot with panelling by surface for an easier gender-wise comparison than in the original figure.
restroomRgsm19 = prune_taxa(top19otus, restroomRgsmp) title = "Figure 1 Part B, Restroom Surface and Gender" p = plot_bar(restroomRgsm19, "GENDER", fill = "family19", title = title) + coord_flip() + labs(colour = "family") p + facet_wrap(~SURFACE)
The original authors use Multidimensional Scaling (also called Principle Coordinates Analysis) to decompose a pairwise distance matrix between all the microbial samples. The original distances are calculated by what microbial ecologists call the “unweighted UniFrac” distance. This approach is useful for graphically displaying high-level trends across all the samples in the dataset, usually by overlaying additional information about each sample, for instance, where it came from.
Unfortunately, calculating the UniFrac distance between each sample requires having an evolutionary (phylogenetic) tree of the bacterial species in the dataset, and this wasn't provided. We will instead attempt to create an equally-illuminating scatterplot using a different distance matrix.
Figure 2. Relationship between bacterial communities associated with ten public restroom surfaces. Communities were clustered using PCoA of the unweighted UniFrac distance matrix. Each point represents a single sample. Note that the floor (triangles) and toilet (asterisks) surfaces form clusters distinct from surfaces touched with hands.
The multidimensional scaling of a UniFrac distance matrix is one instance of a larger category of what we call ordination methods. These methods attempt to decompose the variability of higher-dimensional data into a smaller number of orthogonal (perpendicular) axes that turn out to be pretty useful for plotting and certain clustering methods.
title = "Figure 2 remake, PCoA of Bray-Curtis distance" restroom_ord = ordinate(restroomR, "PCoA", "bray") p = plot_ordination(restroomR, restroom_ord, color = "SURFACE", shape = "GENDER") p = p + geom_point(size = 6, alpha = 0.7) + ggtitle(title) p
p + facet_wrap(~SURFACE)
p + facet_wrap(~GENDER) #, nrow=3)
In addition to
Primer v6, the ANOSIM method is implemented in the vegan package in R, and described nicely in the anosim documentation. It is designed to operate directly on a dissimilarity (distance) matrix, so it is unclear why the authors first decomposed their unweighted UniFrac distance matrix with Principle Coordinate Analysis, and then provided those coordinates (presumably from just the first two axes?) to the ANOSIM implementation in Primer v6.
Important note about multiple testing
anosim function performs a non-parametric test of the significance of the sample-grouping you provide against a permutation-based null distribution, generated by randomly permuting the sample labels many times (999 permutations is the default, used here). The table shown in the original article appears to be the result of the 45 two-surface (pairwise)
anosim tests, excluding water. The precise details are not explained, and we cannot repeat them here because (1) we cannot exactly reproduce the randomly subsampled (“rarefied”) abundance data from the somewhat raw abundance matrix that is hosted on
microbio.me, and more importantly (2) the dataset does not include the phylogenetic tree that is necessary for calculating the unweighted UniFrac distance matrix. Critically, the authors made no mention of any correction for having conducted 45 simultaneous tests. They did take a slightly stricter-than-usual threshold for significance of
P <= 0.01, but we don't know without a formal correction if the tests with
P close to
0.01 are actually significant (even at the more common
0.05 level), and the P-values themselves were not reported, just the statistic with and without asterisk.
Here is how to re-perform
anosim using R and the vegan package:
##  '2.0.7'
First look at results for surface types.
surface_group = get_variable(restroomR, "SURFACE") surface_ano = anosim(distance(restroomR, "bray"), surface_group) surface_ano$signif
##  0.001
##  0.5514
Now test bathroom genders. First remove samples for which there was no gender to make this a pairwise test.
restroomRg = subset_samples(restroomR, !GENDER == "None") gender_group = get_variable(restroomRg, "GENDER") gender_ano = anosim(distance(restroomRg, "bray"), gender_group) gender_ano$signif
##  0.118
##  0.04061
Here we settled for a Bray-Curtis distance matrix for our own version of just two
anosim calculations, one of which would not be considered significant by most standards (gender).
I considered adding a demonstration of how to add a multiple correction for the approach the authors took, but I'm not convinced (yet) that this was the optimal statistical approach to arrive at the knowledge that they were seeking. For one thing, it quickly loses power as the number of categories increases because of the larger number of corrections, not to mention it becomes difficult to calculate. Practicality aside, even the documentation for
anosim in R suggests users use
adonis (a permutation MANOVA, also in the vegan package) as “a more robust alternative”.
df = as(sample_data(restroomR), "data.frame") d = distance(restroomR, "bray") restroomadonis = adonis(d ~ SURFACE + GENDER + BUILDING + FLOOR, df) restroomadonis
## ## Call: ## adonis(formula = d ~ SURFACE + GENDER + BUILDING + FLOOR, data = df) ## ## Terms added sequentially (first to last) ## ## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F) ## SURFACE 10 8.19 0.819 4.12 0.470 0.001 *** ## GENDER 1 0.50 0.504 2.53 0.029 0.007 ** ## BUILDING 1 0.31 0.310 1.56 0.018 0.081 . ## FLOOR 2 0.47 0.236 1.19 0.027 0.212 ## Residuals 40 7.96 0.199 0.456 ## Total 54 17.44 1.000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The soil source data was taken from this 2009 article by Lauber et al. in AEM
The authors describe using an algorithm called “SourceTracker”, previously described in Knights et al. (2011) Nature Methods, and provided as a commented R script published at sourceforge/sourcetracker.
In perhaps another post, or a continuation of this one, I will include their code and apply it to the available data here. One additional challenge is the availability of the refernce data, which was not repackaged or republished with this dataset. Hopefully it is all still available.
Please feel free to post comments, suggestions. Also, ideas for other demos are strongly encouraged. A good place for that is probably the issue tracker for the phyloseq-demo repository where this document is versioned and where many other phyloseq-related demos are hosted.
Thanks to original authors for an interesting study and for making (most of) their data so publicly available. I just subjected it to some scholarly scrutiny, which is a good thing (and I hope they agree); and only possible because of what they made available. If you like the idea of being able to re-analyze this and other studies, then clamor for better reproducible research, which should hopefully the original and intermediate data as well as analysis source code … and strive to do the same in your own publications. We will all do better, faster work because of it :)