Gap Statistic

How many clusters are there?

From the clusGap documentation: The clusGap function from the cluster package calculates a goodness of clustering measure, called the “gap” statistic. For each number of clusters k, it compares (W(k)) with E^*[(W(k))] where the latter is defined via bootstrapping, i.e. simulating from a reference distribution.

The following is an example performing the gap statistic on ordination results calculated using phyloseq tools, followed by an example of how a ggplot-based wrapper for this example might be included in the phyloseq package.

First perform an ordination

In this case, MDS on the Bray-Curtis distance.

library("phyloseq"); packageVersion("phyloseq")

## [1] '1.22.3'

library("cluster"); packageVersion("cluster")

## [1] '2.0.6'

library("ggplot2"); packageVersion("ggplot2")

## [1] '2.2.1'

theme_set(theme_bw())
# Load data
data(enterotype)
# ordination
exord = ordinate(enterotype, method="MDS", distance="jsd")

Compute Gap Statistic

pam1 = function(x, k){list(cluster = pam(x,k, cluster.only=TRUE))}
x = phyloseq:::scores.pcoa(exord, display="sites")
# gskmn = clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn = clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 50)

## Clustering k = 1,2,..., K.max (= 6): .. done
## Bootstrapping, b = 1,2,..., B (= 50)  [one "." per sample]:
## .................................................. 50

gskmn

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = x[, 1:2], FUNcluster = pam1, K.max = 6, B = 50)
## B=50 simulated reference sets, k = 1..6; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##          logW   E.logW       gap     SE.sim
## [1,] 2.995599 3.113476 0.1178770 0.02031872
## [2,] 2.209852 2.771964 0.5621123 0.02358254
## [3,] 1.922188 2.583273 0.6610850 0.02854903
## [4,] 1.685798 2.408618 0.7228201 0.02858372
## [5,] 1.601025 2.276989 0.6759638 0.02019060
## [6,] 1.480640 2.177991 0.6973513 0.02167414

Pretty straightforward. In case it is useful to see, this is what a wrapper-function might look like to “add-on” code to phyloseq.

gap_statistic_ordination = function(ord, FUNcluster, type="sites", K.max=6, axes=c(1:2), B=500, verbose=interactive(), ...){
    require("cluster")
    #   If "pam1" was chosen, use this internally defined call to pam
    if(FUNcluster == "pam1"){
        FUNcluster = function(x,k) list(cluster = pam(x, k, cluster.only=TRUE))     
    }
    # Use the scores function to get the ordination coordinates
    x = phyloseq:::scores.pcoa(ord, display=type)
    #   If axes not explicitly defined (NULL), then use all of them
    if(is.null(axes)){axes = 1:ncol(x)}
    #   Finally, perform, and return, the gap statistic calculation using cluster::clusGap  
    clusGap(x[, axes], FUN=FUNcluster, K.max=K.max, B=B, verbose=verbose, ...)
}

Plot Results

Define a plot method for results

plot_clusgap = function(clusgap, title="Gap Statistic calculation results"){
    require("ggplot2")
    gstab = data.frame(clusgap$Tab, k=1:nrow(clusgap$Tab))
    p = ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size=5)
    p = p + geom_errorbar(aes(ymax=gap+SE.sim, ymin=gap-SE.sim))
    p = p + ggtitle(title)
    return(p)
}

Now try out this function. Should work on ordination classes recognized by scores function, and provide a ggplot graphic instead of a base graphic. (Special Note: the phyloseq-defined scores extensions are not exported as regular functions to avoid conflict, so phyloseq-defined scores extensions can only be accessed with the phyloseq::: namespace prefix in front.)

gs = gap_statistic_ordination(exord, "pam1", B=50, verbose=FALSE)
print(gs, method="Tibs2001SEmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = x[, axes], FUNcluster = FUNcluster, K.max = K.max,     B = B, verbose = verbose)
## B=50 simulated reference sets, k = 1..6; spaceH0="scaledPCA"
##  --> Number of clusters (method 'Tibs2001SEmax', SE.factor=1): 4
##          logW   E.logW       gap     SE.sim
## [1,] 2.995599 3.114646 0.1190475 0.02035141
## [2,] 2.209852 2.769811 0.5599590 0.02040883
## [3,] 1.922188 2.586105 0.6639169 0.02058187
## [4,] 1.685798 2.414474 0.7286765 0.02770392
## [5,] 1.601025 2.279208 0.6781836 0.02079129
## [6,] 1.480640 2.176011 0.6953714 0.02481494

plot_clusgap(gs)

Base graphics plotting, for comparison.

plot(gs, main = "Gap statistic for the 'Enterotypes' data")
mtext("Looks like 4 clusters is best, with 3 and 5 close runners up.")