Example importing into phyloseq the files produced by Qiime run on the HMPv35 dataset, which is avilable from HMP-DACC. Even more specifically this data is from the HMP QIIME Community Profiling data page. This is important to note because there are also “parallel versions” of this data generated by mothur which could also be imported by phyloseq; and there are also a growing number of data links and displays added to the main HMP-DACC page that might otherwise obfuscate where this data came from. However, the precise links are shown in the code below. In this way we are demonstrating both the import_qiime
function in phyloseq, as well as the import of web-facing data in general using R.
Note that this is a document produced by the markdown package from an R Markdown file. Markdown is a simple formatting syntax for authoring web pages, and in this case the structure includes reproducible code.
library("phyloseq")
packageVersion("phyloseq")
## [1] '1.7.21'
Because this demo includes an example of manually importing a Fasta-formatted DNA sequence file using a function from the Biostrings package, we will need to load that as well (it is already a phyloseq dependency, so nothing exotic here)
library("Biostrings")
packageVersion("Biostrings")
## [1] '2.30.1'
As data sources could be updated (or their links removed), and the phyloseq package is itself under active development, I am posting here the date/time stamp that this demo was built, as extra compatibility metadata to supplement the package version numbers shown above
date()
## [1] "Mon Mar 3 20:09:03 2014"
Note that R's file handling is sophisticated enough that you do not need to unpack/unzip the .gz
- or .bz2
-compressed files. They are understood automatically.
Create a temporary local file where there large otu-tax file will be downloaded and read-from.
temp_otutax = tempfile()
otu_url = "http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v35.txt.gz"
download.file(otu_url, temp_otutax)
temp_map = tempfile()
map_url = "http://downloads.hmpdacc.org/data/HMQCP/v35_map_uniquebyPSN.txt.bz2"
download.file(map_url, temp_map)
temp_map = bzfile(temp_map)
temp_tree = tempfile()
tree_url = "http://downloads.hmpdacc.org/data/HMQCP/rep_set_v35.tre.gz"
download.file(tree_url, temp_tree)
It turns out that the tree has weird quotes added around tip names, for example. The following will read the .gz
tree directly (no extra uncompression step needed).
tree = read_tree(temp_tree)
head(taxa_names(tree))
## [1] "'OTU_97.15099'" "'OTU_97.13686'" "'OTU_97.30326'" "'OTU_97.26112'"
## [5] "'OTU_97.34719'" "'OTU_97.12776'"
These extra quotes ( "'"
) must be removed/replaced. This can be accomplished using the following gsub
command. Even for 45364 OTUs this is very fast.
tree$tip.label <- gsub("'", "", tree$tip.label, fixed = TRUE)
head(tree$tip.label)
## [1] "OTU_97.15099" "OTU_97.13686" "OTU_97.30326" "OTU_97.26112"
## [5] "OTU_97.34719" "OTU_97.12776"
Unfortunately, the readDNAStringSet
function for importing the representative sequences – which we will call refseq
(for “reference sequences”) – does not accept standard R file connections for reading files, possibly because it is performing the parsing at a very low level? In any case, unlike the other files we are importing, this must be unzipped first to an uncompressed .fna
file, before we read and parse.
zipped_refseq = tempfile()
refseq_url = "http://downloads.hmpdacc.org/data/HMQCP/rep_set_v35.fna.gz"
download.file(refseq_url, zipped_refseq)
I use the gzfile
function to open an R connection to the .gz
compressed representative sequence file that we just downloaded. Then, use readLines
and writeLines
to essentially “pipe” this uncompressed to our second temporary file, called unzipped_rsfile
.
unzipped_rsfile = tempfile()
refseqgzcon = gzfile(zipped_refseq)
writeLines(readLines(refseqgzcon), unzipped_rsfile)
Now that I have an uncompressed version of the file saved in as temporary (but known) file, I can use the readDNAStringSet
function without error and store the initial import result as the object, refseq
.
refseqs = readDNAStringSet(unzipped_rsfile)
refseqs
## A DNAStringSet instance of length 45411
## width seq names
## [1] 529 TTCAACCCTGCGGTCGTAC...TCGGTCGAGGTTGCCCCC OTU_97.1 SRS01635...
## [2] 523 TTTAATTCTTGCGACCGTA...GACACGCGGCATGGCTGG OTU_97.10 SRS0174...
## [3] 525 TTCACCGTTGCCGGCGTAC...TGGTTCAGACTCTCGTCC OTU_97.100 SRS020...
## [4] 530 TTTAATCTTGCGACCGTAC...ATCAGGGTTCCCCCCTAC OTU_97.1000 SRS01...
## [5] 552 TTCACACTTGCGTGCGTAC...ACTTCCCCTACTCGTCCG OTU_97.10000 SRS0...
## ... ... ...
## [45407] 500 TTCAGCCTTGCGGCCGTAC...ACACAGAATTTGCTGGAC OTU_97.9995 SRS01...
## [45408] 512 TTCACCGTTGCCGGCGTAC...CGCTACTTGGCTGGTTAC OTU_97.9996 SRS01...
## [45409] 518 TTCACCGTTGCCGGCGTAC...TACTTGGCTGGTCAGTAC OTU_97.9997 SRS02...
## [45410] 513 TTCACCGTTGCCGGCGTAC...GCTACTTGGCTGGTTCAG OTU_97.9998 SRS01...
## [45411] 521 TTTAGTCTTGCGACCGTAC...TTCTACACCACGCGGTAC OTU_97.9999 SRS01...
head(names(refseqs))
## [1] "OTU_97.1 SRS016352.SRX019682_1573637534"
## [2] "OTU_97.10 SRS017406.SRX019635_1429133723"
## [3] "OTU_97.100 SRS020032.SRX019636_433516183"
## [4] "OTU_97.1000 SRS016352.SRX019682_1573635514"
## [5] "OTU_97.10000 SRS013510.SRX019689_872338221"
## [6] "OTU_97.10001 SRS024422.SRX019635_1668339487"
We need to parse the “names” of each sequence entry in refseq
such that they match the format of the OTU names in the other files. In this case, that means removing everything from the first space on. There are multiple ways to do this using R, and in the following code we use regular-expression pattern matching and substitution available in the gsub
function.
names(refseqs) = gsub("(^[[:print:]]{7,13})([[:space:]]{1,})([[:print:]]{0,}$)",
"\\1", names(refseqs))
head(names(refseqs))
## [1] "OTU_97.1" "OTU_97.10" "OTU_97.100" "OTU_97.1000"
## [5] "OTU_97.10000" "OTU_97.10001"
I have included a timing function call so that you can see how long this large file took to import using my laptop. Also, the number of “chunks” is shown by the number of “dots” printed after the function call. Note that the phylogenetic tree was already imported, slightly modified in the previous code chunk, and we want this to provide this tree
as the tree argument, rather than the raw tree file, which won't exactly match the taxa_names
in the OTU-Tax file.
We have actually already imported the tree and reference sequences as tree
and refseqs
, respectively; and we are merely passing them along to be included in the object that is output by import_qiime
. The hard work and the bulk of the processing time for import_qiime
will be parsing the OTU-abundance/taxonomy text file in RAM-manageable chunks, prior to combining with the other data components and storing as the phyloseq object, HMPv35
.
Note that you will multiple instances of the following warning:
## Warning: No greengenes prefixes were found. Consider using
## parse_taxonomy_default() instead if true for all OTUs. Dummy ranks may be
## included among taxonomic ranks now.
These arise from OTUs that have only been assigned a “root” taxonomic classification – which is meaningless – and no other. In this dataset, the “root” has no greengenes prefix but all the other taxonomic classifications do. When the default greengenes parsing function (parse_taxonomy_default
) finds these OTUs, it complains that there was not actually a greengenes prefix in any of the taxonomy elements. That's okay, your data (and this data) is fine. It is just a warning, and all the meaningless “root” classification elements are assigned to an equally meaningless dummy taxonomic rank called “Rank1”. See for yourself below in the output of rank_names
.
system.time(HMPv35 <- import_qiime(temp_otutax, temp_map, tree, refseqs))
## Processing map file...
## Processing otu/tax file...
## Reading file into memory prior to parsing...
## Detecting first header line...
## Header is on line 1
## Converting input file to a table...
## Defining OTU table...
## Parsing taxonomy table...
## Processing phylogenetic tree...
## Processing Reference Sequences...
## user system elapsed
## 124.33 42.47 176.87
For more advanced tools for exploring the data, see additional phyloseq documentation
ntaxa(HMPv35)
## [1] 45336
nsamples(HMPv35)
## [1] 4743
sample_variables(HMPv35)
## [1] "X.SampleID" "RSID" "visitno" "sex"
## [5] "RUNCENTER" "HMPbodysubsite" "Mislabeled" "Contaminated"
## [9] "Description"
rank_names(HMPv35)
## [1] "Rank1" "Phylum" "Class" "Order" "Family" "Genus"
head(tax_table(HMPv35))
## Taxonomy Table: [6 taxa by 6 taxonomic ranks]:
## Rank1 Phylum Class Order
## OTU_97.15099 "Root" "Firmicutes" "Bacilli" "Lactobacillales"
## OTU_97.13686 "Root" "Firmicutes" "Bacilli" "Lactobacillales"
## OTU_97.30326 "Root" "Firmicutes" "Bacilli" "Lactobacillales"
## OTU_97.26112 "Root" "Firmicutes" "Bacilli" "Lactobacillales"
## OTU_97.34719 "Root" "Firmicutes" "Bacilli" "Lactobacillales"
## OTU_97.12776 "Root" "Firmicutes" "Bacilli" "Lactobacillales"
## Family Genus
## OTU_97.15099 "Streptococcaceae" "Streptococcus"
## OTU_97.13686 "Streptococcaceae" "Streptococcus"
## OTU_97.30326 "Streptococcaceae" "Streptococcus"
## OTU_97.26112 "Streptococcaceae" "Streptococcus"
## OTU_97.34719 "Streptococcaceae" "Streptococcus"
## OTU_97.12776 "Streptococcaceae" "Streptococcus"
Save HMPv35
to an ".RData"
“ file so that you don't have to run this import ever again. In the following example, I'm saving it to the current directory of my process. Your preferred file path for saving R data may (and probably should) vary.
save(HMPv35, file = "HMPv35.RData", compress = "bzip2")
We re-run this import code each time we update/rebuild these demo vignettes. This already-imported HMPv35
dataset is available by cloning or downloading the phyloseq-demo repository. Typically this RData
binary file will load into R much, much faster than importing it again from the public HMP data files.