How many clusters are there?

From the clusGap documentation: The clusGap function from the cluster package calculates a goodness of clustering measure, called the “gap” statistic. For each number of clusters k, it compares (W(k)) with E^*[(W(k))] where the latter is defined via bootstrapping, i.e. simulating from a reference distribution.

The following is an example performing the gap statistic on ordination results calculated using phyloseq tools, followed by an example of how a ggplot-based wrapper for this example might be included in the phyloseq package.

First perform an ordination

In this case, MDS on the Bray-Curtis distance.

library("phyloseq"); packageVersion("phyloseq")
## [1] '1.16.2'
library("cluster"); packageVersion("cluster")
## [1] '2.0.4'
library("ggplot2"); packageVersion("ggplot2")
## [1] '2.1.0'
theme_set(theme_bw())
# Load data
data(enterotype)
# ordination
exord = ordinate(enterotype, method="MDS", distance="jsd")

Compute Gap Statistic

pam1 = function(x, k){list(cluster = pam(x,k, cluster.only=TRUE))}
x = phyloseq:::scores.pcoa(exord, display="sites")
# gskmn = clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn = clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 50)
gskmn
## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..6
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##          logW   E.logW       gap     SE.sim
## [1,] 2.995599 3.114563 0.1189645 0.02092678
## [2,] 2.209852 2.772554 0.5627021 0.02326687
## [3,] 1.922188 2.590389 0.6682004 0.02603707
## [4,] 1.685798 2.417158 0.7313603 0.02616323
## [5,] 1.601025 2.279798 0.6787737 0.01999222
## [6,] 1.480640 2.173728 0.6930883 0.02188804

Pretty straightforward. In case it is useful to see, this is what a wrapper-function might look like to “add-on” code to phyloseq.

gap_statistic_ordination = function(ord, FUNcluster, type="sites", K.max=6, axes=c(1:2), B=500, verbose=interactive(), ...){
    require("cluster")
    #   If "pam1" was chosen, use this internally defined call to pam
    if(FUNcluster == "pam1"){
        FUNcluster = function(x,k) list(cluster = pam(x, k, cluster.only=TRUE))     
    }
    # Use the scores function to get the ordination coordinates
    x = phyloseq:::scores.pcoa(ord, display=type)
    #   If axes not explicitly defined (NULL), then use all of them
    if(is.null(axes)){axes = 1:ncol(x)}
    #   Finally, perform, and return, the gap statistic calculation using cluster::clusGap  
    clusGap(x[, axes], FUN=FUNcluster, K.max=K.max, B=B, verbose=verbose, ...)
}

Plot Results

Define a plot method for results

plot_clusgap = function(clusgap, title="Gap Statistic calculation results"){
    require("ggplot2")
    gstab = data.frame(clusgap$Tab, k=1:nrow(clusgap$Tab))
    p = ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size=5)
    p = p + geom_errorbar(aes(ymax=gap+SE.sim, ymin=gap-SE.sim))
    p = p + ggtitle(title)
    return(p)
}

Now try out this function. Should work on ordination classes recognized by scores function, and provide a ggplot graphic instead of a base graphic. (Special Note: the phyloseq-defined scores extensions are not exported as regular functions to avoid conflict, so phyloseq-defined scores extensions can only be accessed with the phyloseq::: namespace prefix in front.)

gs = gap_statistic_ordination(exord, "pam1", B=50, verbose=FALSE)
print(gs, method="Tibs2001SEmax")
## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..6
##  --> Number of clusters (method 'Tibs2001SEmax', SE.factor=1): 4
##          logW   E.logW       gap     SE.sim
## [1,] 2.995599 3.110773 0.1151745 0.01957396
## [2,] 2.209852 2.767382 0.5575303 0.01873947
## [3,] 1.922188 2.581996 0.6598080 0.02314878
## [4,] 1.685798 2.408179 0.7223816 0.02549674
## [5,] 1.601025 2.276531 0.6755064 0.02266678
## [6,] 1.480640 2.180340 0.6996997 0.02696254
plot_clusgap(gs)

Base graphics plotting, for comparison.

plot(gs, main = "Gap statistic for the 'Enterotypes' data")
mtext("Looks like 4 clusters is best, with 3 and 5 close runners up.")