Future Development of phyloseq

Issues Tracker

Additional development of phyloseq is ongoing, the details of which are documented heavily at the phyloseq issues tracker.

High level goals

In broader strokes, some near-term plans include:

The “compartmentalization” of the data infrastructure portion of the phyloseq package into a separate Bioconductor package, a la this issue. This may reduce the number of dependencies implied when reusing the data tools of phyloseq for other Bioconductor/R packages.
The development and calibration of additional tools in phyloseq for formal preprocessing, filtering, normalization, shrinkage and variance stabilization (e.g. Allison 2006)
Additional structured wrapping tools to the ade4 and vegan packages, with the most commonly-used tools given specific wrapping functions in phyloseq or phyloseqBase.
Structures for additional data components, e.g. mass spec and expression data.
Animated ordinations for time-series (and analogous) data. See beta-version support package: animate.phyloseq
The compilation of a Bioconductor data package (“phyloseqData”) that includes many key published datasets already imported as separate “phyloseq” instances and available through R’s data interface.
Big(ger) Data The firehose of new-gen sequencing data is making possible “big” datasets in this realm. For example, the demo on importing the Human Microbiome Project data into R takes a considerable amount of time to run on a typical desktop/laptop and may push some less powerful machines to their limit. And that’s just data import. We are considering some of the best approaches to help a tool like phyloseq address computational issues that are arising from dealing with this data of this size, without compromising some of the other features (interactivity, reproducibility, connection with existing R tools). Some promising tools already available for R that might help include:

Sparse Representation

This is simply a back-end consideration for very-large datasets that challenge available memory. This is not the case for most users, or datasets.

The Matrix package has a wonderful pantheon of data class extensions that could address this, for instance.

Improved Interoperability with biomformat

Option to store full dataset in a comprehensive, cross-language data structure. The biom-format is a good candidate, especially now that it has an HDF5 definition to satisfy the case of very large datasets, where on-disk storage might make the most sense.

Some candidate packages for storage are:

HDF5 via rhdf5, biomformat packages
the ff package

Future Development of phyloseq

Mon Mar 12 15:05:44 2018

Issues Tracker

High level goals

Sparse Representation

Improved Interoperability with biomformat