Cluster analysis

Why?

  1. Too many dimensions

  2. Need to sort and group

  3. Test a hypothesis | Explore hypotheses

Ordinations

Ordinations - taking complex data and working to present it in 2-3 dimensions, we are placing a new coordinate system (fitting) in place. Principle Component Analysis (PCA) works to generate this coordinate system that is most appropriate for your data. Evaluated based on the eigen decomposition from sample covariance matrix. (Eigenvectors = principle axes)

Background

References / recommended reading:

  1. THIS TUTORIAL: Coenen AR, Hu SK, Luo E, Muratore D, Weitz JS. A Primer for Microbiome Time-Series Analysis. Front Genet. 2020 Apr 21;11:310.

  2. Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. & Egozcue, J. J. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8, 57–6 (2017).

  3. Weiss, S. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 1–18 (2017).

  4. McMurdie, P. J. & Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol 10, e1003531 (2014).

  5. Silverman JD, Washburne AD, Mukherjee S, David LA. A phylogenetic transform enhances analysis of compositional microbiota data. eLife. 2017 Feb 15;6:e21887.

Import data

Set up preliminary environment

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(compositions)
Warning: package 'compositions' was built under R version 4.2.3
Welcome to compositions, a package for compositional data analysis.
Find an intro with "? compositions"


Attaching package: 'compositions'

The following objects are masked from 'package:stats':

    anova, cor, cov, dist, var

The following object is masked from 'package:graphics':

    segments

The following objects are masked from 'package:base':

    %*%, norm, scale, scale.default

Evidence of compositionality (diagnostic)

Do we need to transform our data? We can use the estimate covariance matrix

Prep data frame

# dim(asv_table)
# head(asv_table)

# Random subset 1000 rows
asv_df <- asv_table |> 
  sample_n(1000) |> 
  column_to_rownames(var = "FeatureID") |> 
  select(-Taxon)
  
# dim(asv_df)

%*% = matrix multiplication sign in R; used here to multiply OTU/ASV data matrix to itself to estimate covariance.

covariance <- as.matrix(asv_df) %*% t(asv_df)
det_output <- det(covariance)
det_output
[1] 0

The determinant of the covariance matrix (what we just calculated) is equivalent to the product of the proportion of variance explained by every PCA axis. If the determinant is 0, that means there is an axis which explains 0 variance that we can’t separate from the other axes. Therefore, data need to be transformed to be suitable for PCA.

Data transformations

Another approach to PCA ordination w/Compositional Data: Log-Ratio Transformations

  • Log-ratio - combat issues with variance and distribution, all variables will be transformed to be relative to the abundance of an arbitrary focal taxon. Divide all the count data by the abundance of a focal taxa and take the natural log. So why do this? This is one way we can transform our amplicon data to be more appropriate/more suitable for many statistical approaches.

  • ilr - isometric log ratio

  • clr - centered log ratio, values will be centered around the geometric average

Isometric

gorda_ilr <- data.frame(compositions::ilr(t(asv_df)))
cov_ilr <- as.matrix(gorda_ilr) %*% t(gorda_ilr)
det_ilr <- det(cov_ilr)
det_ilr
[1] 2.426553e+40

prcomp() - Principle Component Analysis. Input is a data matrix and results are an R class “prcomp”.

Explore the covariance after doing the transformation.

ilr_pca <- prcomp(gorda_ilr)
class(ilr_pca)
[1] "prcomp"
ilr_pca$sdev 
 [1] 5.178359e+00 4.451509e+00 4.126228e+00 3.737649e+00 3.415436e+00
 [6] 3.367891e+00 3.237619e+00 3.204526e+00 3.121084e+00 3.036606e+00
[11] 2.996977e+00 2.899159e+00 2.780506e+00 2.618325e+00 2.560523e+00
[16] 2.146984e+00 1.862622e+00 2.388152e-15

follow up reading on prcomp

# Visual representation - screeplot
lograt_variances <- as.data.frame(ilr_pca$sdev^2/sum(ilr_pca$sdev^2)) %>% #Extract axes
  # Format to plot
  select(PercVar = 'ilr_pca$sdev^2/sum(ilr_pca$sdev^2)') %>% 
  rownames_to_column(var = "PCaxis") %>% 
  data.frame
# Plot screeplot
ggplot(lograt_variances, aes(x = as.numeric(PCaxis), y = PercVar)) + 
  geom_bar(stat = "identity", fill = "grey", color = "black") +
  theme_minimal() +
  theme(axis.title = element_text(color = "black", face = "bold", size = 10),
        axis.text.y = element_text(color = "black", face = "bold"),
        axis.text.x = element_blank()) +
  labs(x = "PC axis", y = "% Variance", title = "Log-Ratio PCA Screeplot")

Center log ratio

gorda_clr <- data.frame(compositions::clr(t(asv_df)))
cov_clr <- as.matrix(gorda_clr) %*% t(gorda_clr)
det_clr <- det(cov_clr)
det_clr
[1] 2.426553e+40
clr_pca <- prcomp(gorda_clr)

clr_variances <- as.data.frame(clr_pca$sdev^2/sum(clr_pca$sdev^2)) %>% #Extract axes
  # Format to plot
  select(PercVar = 'clr_pca$sdev^2/sum(clr_pca$sdev^2)') %>% 
  rownames_to_column(var = "PCaxis") %>% 
  data.frame

# Plot screeplot
ggplot(clr_variances, aes(x = as.numeric(PCaxis), y = PercVar)) + 
  geom_bar(stat = "identity", fill = "grey", color = "black") +
  theme_minimal() +
  theme(axis.title = element_text(color = "black", face = "bold", size = 10),
        axis.text.y = element_text(color = "black", face = "bold"),
        axis.text.x = element_blank()) +
  labs(x = "PC axis", y = "% Variance", title = "Log-Ratio PCA Screeplot")

PCA

Lets pick one. My favorite is always the clr.

class(clr_pca)
[1] "prcomp"
# Visualize the PCA
# clr_pca$
clr_pca$x #View PC values
                                               PC1       PC2        PC3
GordaRidge_Vent039_SUPRS1_2019          -7.8136532  7.494693 -2.8582200
GordaRidge_Vent040_SUPRS2_2019          -9.1363140  4.538102  0.3794693
GordaRidge_BSW081_sterivex_2019          1.6009412  1.545660 -7.0441090
GordaRidge_BSW056_sterivex_2019_REPb     0.4809999 -5.890110 -0.8983877
GordaRidge_Vent088_SUPRS3_2019           5.2263501  1.191327  2.6590195
GordaRidge_Plume001_sterivex_2019_REPa   7.7946359  5.972814  6.6121580
GordaRidge_Plume001_sterivex_2019_REPb   5.0334438  4.528548  1.8486798
GordaRidge_Plume036_sterivex_2019_REPb   2.2261855  1.496157 -8.1632903
GordaRidge_Plume096_sterivex_2019        1.0336953 -6.241536 -1.4581866
GordaRidge_Vent105_SUPRS9_2019          -0.2470926 -3.422314 -5.4194807
GordaRidge_Vent041_SUPRS3_2019         -11.2811708  1.871824  4.4183622
GordaRidge_Vent086_SUPRS1_2019           4.5859168  1.494836 -1.1330836
GordaRidge_Vent107_SUPRS11_2019          2.1292156  0.118059  3.9059303
GordaRidge_Vent009_SUPRS1_2019          -0.3035928 -2.064704 -2.5717286
GordaRidge_Vent011_SUPRS3_2019          -4.1701019 -6.917408  5.0390290
GordaRidge_Vent106_SUPRS10_2019          2.6372677 -3.654706  2.6437269
GordaRidge_Vent010_SUPRS2_2019          -2.3235680 -5.455979  1.7994360
GordaRidge_Vent087_SUPRS2_2019           2.5268413  3.394736  0.2406755
                                              PC4        PC5        PC6
GordaRidge_Vent039_SUPRS1_2019          0.5853997 -1.8599013  0.8335125
GordaRidge_Vent040_SUPRS2_2019          0.9513623  0.9762615 -0.7453047
GordaRidge_BSW081_sterivex_2019        -0.5990361 -4.8957990  9.2979107
GordaRidge_BSW056_sterivex_2019_REPb   -3.8027815 -3.4603966 -3.9973122
GordaRidge_Vent088_SUPRS3_2019          3.1452520 -1.7983552 -0.9985789
GordaRidge_Plume001_sterivex_2019_REPa -5.9821424  0.5643154 -0.6252184
GordaRidge_Plume001_sterivex_2019_REPb -3.7657968 -0.2333092  0.6694650
GordaRidge_Plume036_sterivex_2019_REPb -1.1361132 10.5142376 -1.8330749
GordaRidge_Plume096_sterivex_2019       2.0373430 -0.2214074  0.5018464
GordaRidge_Vent105_SUPRS9_2019         -6.2391838 -3.7399956 -5.5263953
GordaRidge_Vent041_SUPRS3_2019         -0.7753195 -1.0841440 -2.3449245
GordaRidge_Vent086_SUPRS1_2019         10.0067007 -2.6197402 -4.9560969
GordaRidge_Vent107_SUPRS11_2019        -0.6898155  1.1138606  2.5978148
GordaRidge_Vent009_SUPRS1_2019         -0.1128133 -0.0803269  0.7626128
GordaRidge_Vent011_SUPRS3_2019          0.3078292  3.6034640  2.8101004
GordaRidge_Vent106_SUPRS10_2019         1.9887275 -0.5606825  1.6681576
GordaRidge_Vent010_SUPRS2_2019          0.6994025  2.0001555  2.2287919
GordaRidge_Vent087_SUPRS2_2019          3.3809852  1.7817633 -0.3433062
                                               PC7        PC8          PC9
GordaRidge_Vent039_SUPRS1_2019          0.10984998  0.3696854  0.715605912
GordaRidge_Vent040_SUPRS2_2019          0.59808106  1.1344267  2.859706232
GordaRidge_BSW081_sterivex_2019         0.08611798  1.1064133 -0.571670051
GordaRidge_BSW056_sterivex_2019_REPb    4.50619192  1.7292399  5.920244146
GordaRidge_Vent088_SUPRS3_2019          1.57717333 -6.7397281 -2.274740941
GordaRidge_Plume001_sterivex_2019_REPa -0.44204944  4.2474052 -2.591276630
GordaRidge_Plume001_sterivex_2019_REPb  0.17281861  2.5605504 -0.079034538
GordaRidge_Plume036_sterivex_2019_REPb  3.02235302 -0.4671150 -1.010926692
GordaRidge_Plume096_sterivex_2019       1.52238580  0.3511822 -6.053876020
GordaRidge_Vent105_SUPRS9_2019         -5.70963255 -2.7002895 -0.950053369
GordaRidge_Vent041_SUPRS3_2019          2.32677353 -1.7147593 -5.064775761
GordaRidge_Vent086_SUPRS1_2019          0.44824464  5.3025673  1.053929872
GordaRidge_Vent107_SUPRS11_2019         3.43525619 -5.3713132  5.899991073
GordaRidge_Vent009_SUPRS1_2019          0.22278117  0.7541454 -0.068182539
GordaRidge_Vent011_SUPRS3_2019         -4.98424817  3.3040684  1.407308877
GordaRidge_Vent106_SUPRS10_2019         1.62767517 -1.9783337 -1.492912037
GordaRidge_Vent010_SUPRS2_2019         -0.41941509  1.7196031  0.003326381
GordaRidge_Vent087_SUPRS2_2019         -8.10035714 -3.6077484  2.297336084
                                              PC10       PC11       PC12
GordaRidge_Vent039_SUPRS1_2019          1.39946675  3.2753120  1.8936698
GordaRidge_Vent040_SUPRS2_2019          0.14301966  3.0504403  4.4990535
GordaRidge_BSW081_sterivex_2019        -1.89264463 -2.7457225 -1.8060827
GordaRidge_BSW056_sterivex_2019_REPb   -3.55087935  3.4437896 -3.8670740
GordaRidge_Vent088_SUPRS3_2019         -7.12165231  0.5030758  3.9059055
GordaRidge_Plume001_sterivex_2019_REPa  0.15687793  0.3409309  0.5627682
GordaRidge_Plume001_sterivex_2019_REPb  0.16114175  0.5779229 -0.2249137
GordaRidge_Plume036_sterivex_2019_REPb -0.66714902 -1.3706984 -0.9992002
GordaRidge_Plume096_sterivex_2019       1.53890046  4.2112810  1.4302817
GordaRidge_Vent105_SUPRS9_2019          2.62814219 -2.9812951  3.0570484
GordaRidge_Vent041_SUPRS3_2019         -0.69592172 -4.2901449 -5.5715380
GordaRidge_Vent086_SUPRS1_2019          1.34187733 -4.2906166  0.2306734
GordaRidge_Vent107_SUPRS11_2019         4.10548120 -4.4581073  1.4173089
GordaRidge_Vent009_SUPRS1_2019          0.02649703  0.4100601  0.2067263
GordaRidge_Vent011_SUPRS3_2019         -2.10549955 -2.1273228  1.6182111
GordaRidge_Vent106_SUPRS10_2019         7.11184493  3.4292522 -1.7346496
GordaRidge_Vent010_SUPRS2_2019         -1.84599584 -0.2704446  1.1108578
GordaRidge_Vent087_SUPRS2_2019         -0.73350682  3.2922873 -5.7290464
                                              PC13        PC14       PC15
GordaRidge_Vent039_SUPRS1_2019          1.45574041 -1.36430736 -6.8306400
GordaRidge_Vent040_SUPRS2_2019         -1.81659395  2.01184113  6.3836722
GordaRidge_BSW081_sterivex_2019        -1.08036086  1.65489767  1.8269364
GordaRidge_BSW056_sterivex_2019_REPb    0.10883450  1.13767420 -0.6164449
GordaRidge_Vent088_SUPRS3_2019         -2.27047006  0.56597894 -1.1972686
GordaRidge_Plume001_sterivex_2019_REPa  0.53032296  0.22273083  0.7574762
GordaRidge_Plume001_sterivex_2019_REPb -0.01173338 -0.91437182 -0.5373933
GordaRidge_Plume036_sterivex_2019_REPb -1.34497343  0.93883523 -0.6440523
GordaRidge_Plume096_sterivex_2019       6.86147323  1.89215460  1.2990508
GordaRidge_Vent105_SUPRS9_2019         -0.90020743 -0.28145036  0.3948457
GordaRidge_Vent041_SUPRS3_2019          0.13314711  0.21061565  0.6181466
GordaRidge_Vent086_SUPRS1_2019          0.05806217  0.04563998 -0.0789418
GordaRidge_Vent107_SUPRS11_2019         4.34022224 -0.01293642  0.1991291
GordaRidge_Vent009_SUPRS1_2019          0.28173491 -2.24746909 -0.2760678
GordaRidge_Vent011_SUPRS3_2019         -0.78939081  4.53450744 -3.2682822
GordaRidge_Vent106_SUPRS10_2019        -6.77198632  0.68508248 -0.8179652
GordaRidge_Vent010_SUPRS2_2019         -0.72164897 -8.63300708  1.2616129
GordaRidge_Vent087_SUPRS2_2019          1.93782767 -0.44641602  1.5261862
                                             PC16        PC17          PC18
GordaRidge_Vent039_SUPRS1_2019          1.2832806  0.75016565  4.712845e-15
GordaRidge_Vent040_SUPRS2_2019         -0.2994072 -0.07895867  2.761361e-15
GordaRidge_BSW081_sterivex_2019         1.0173041  0.76670779 -1.355986e-14
GordaRidge_BSW056_sterivex_2019_REPb    0.7825043  0.60023506  2.493538e-15
GordaRidge_Vent088_SUPRS3_2019         -0.1435484 -0.12737550  2.780838e-15
GordaRidge_Plume001_sterivex_2019_REPa  4.3131427 -0.62686459 -2.221836e-15
GordaRidge_Plume001_sterivex_2019_REPb -7.3322085  1.01808193  2.729658e-15
GordaRidge_Plume036_sterivex_2019_REPb  0.5006400  0.65655982  2.626417e-15
GordaRidge_Plume096_sterivex_2019      -0.4083416  0.88511402  7.339894e-16
GordaRidge_Vent105_SUPRS9_2019          0.1197658  0.72729471 -6.217338e-15
GordaRidge_Vent041_SUPRS3_2019         -0.4930825 -0.15274411  4.177932e-15
GordaRidge_Vent086_SUPRS1_2019          0.1550844  0.33355042 -3.119402e-15
GordaRidge_Vent107_SUPRS11_2019         0.3250068  0.03251975  3.126441e-15
GordaRidge_Vent009_SUPRS1_2019         -0.8332094 -7.06865122  1.195868e-15
GordaRidge_Vent011_SUPRS3_2019         -0.4407791 -0.05969055  3.612404e-15
GordaRidge_Vent106_SUPRS10_2019         0.2927570  0.26516134 -1.216574e-14
GordaRidge_Vent010_SUPRS2_2019          0.8456788  2.02347829  4.365593e-15
GordaRidge_Vent087_SUPRS2_2019          0.3154124  0.05541588  5.188293e-15
pca_clr_df <- data.frame(clr_pca$x) %>% 
  rownames_to_column(var = "SAMPLE") %>%
  left_join(metadata)
Joining with `by = join_by(SAMPLE)`
# head(pca_clr_df)

Make PCA plot

# Plot PCA
ggplot(pca_clr_df) +
  geom_point(aes(x = PC1, y = PC2, fill = VENT), size = 4, shape = 21, color = "black") +
  ylab(paste0('PC2 ', round(lograt_variances[2,2]*100,2),'%')) + #Extract y axis value from variance
  xlab(paste0('PC1 ', round(lograt_variances[1,2]*100,2),'%')) + #Extract x axis value from variance
  scale_fill_brewer(palette = 'Dark2') +
  ggtitle('Centered Log-Ratio PCA Ordination') +
  coord_fixed(ratio = 1) +
  theme_bw()
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

PCoA

PCoA is doing a PCA on a distance matrix constructed from the data. We then need a distance matrix. Different distance metrics emphasize separate attributes/factors in microbial community comparison.

Distance matrix

If we want to prioritize differences in presence/absence between samples - use Jaccard. This can be estimated from untransformed count data, and does a pretty good job considering rare taxa. Another is unweighted Unifrac that includes phylogenetic relatedness (see last week’s lesson).

library(vegan)
Loading required package: permute
Loading required package: lattice
Warning: package 'lattice' was built under R version 4.2.3
This is vegan 2.6-4
library(ape)
Warning: package 'ape' was built under R version 4.2.3

Attaching package: 'ape'
The following object is masked from 'package:compositions':

    balance
The following object is masked from 'package:dplyr':

    where
jac_dmat <- vegdist(t(asv_df),method="jaccard") # Jaccard dist metric
pcoa_jac <- ape::pcoa(jac_dmat) #perform PCoA

We can use the screeplot to determine if we are going in the right direction.

jac_variances <- data.frame(pcoa_jac$values$Relative_eig) %>% 
  select(PercVar = 'pcoa_jac.values.Relative_eig') %>% 
  rownames_to_column(var = "PCaxis") %>% 
  data.frame
ggplot(jac_variances, aes(x = as.numeric(PCaxis), y = PercVar)) + 
  geom_bar(stat = "identity", fill = "grey", color = "black") +
  theme_minimal() +
  theme(axis.title = element_text(color = "black", face = "bold", size = 10),
        axis.text.y = element_text(color = "black", face = "bold"),
        axis.text.x = element_blank()) +
  labs(x = "PC axis", y = "% Variance", title = "Log-Ratio PCoA Screeplot")

# Extract variances from pcoa, from jaccard calculated dist. metric
## where samples fall among axes
pcoa_jac_df <- data.frame(pcoa_jac$vectors) %>% 
  rownames_to_column(var = "TimeofDay") %>% 
  separate(TimeofDay, into = c("excess", "TimeofDay"), sep = "_") %>% 
  select(-excess) %>% 
  data.frame
Warning: Expected 2 pieces. Additional pieces discarded in 18 rows [1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18].
head(pcoa_jac_df)
  TimeofDay       Axis.1      Axis.2       Axis.3       Axis.4       Axis.5
1   Vent039 -0.511333960  0.24510098  0.002724799 -0.014780337  0.001248485
2   Vent040 -0.471925533  0.12993493  0.102824667  0.003851581 -0.018398321
3    BSW081  0.009154644  0.09012354 -0.526259280 -0.307777122  0.315467056
4    BSW056  0.063135741 -0.28182252 -0.035892201 -0.032751972 -0.056781429
5   Vent088  0.249164360  0.11271888  0.148602132  0.061580537  0.127690807
6  Plume001  0.273580285  0.31896957  0.052082558 -0.173870964 -0.235917650
       Axis.6       Axis.7        Axis.8      Axis.9      Axis.10     Axis.11
1  0.00575785 -0.002950044 -0.1035403386 -0.02064122 -0.043766331 -0.03282459
2 -0.02912210  0.031274317 -0.0408526685 -0.01139825  0.007774317  0.01496053
3 -0.07343554 -0.161919327  0.0008682262  0.02164156  0.037738741  0.04190189
4 -0.02412939  0.075845547 -0.1979221811  0.09264871  0.261368972 -0.34640665
5  0.04972538 -0.007616491 -0.1670897130  0.05159573 -0.074337976  0.11709158
6 -0.08461598  0.012419863  0.0722941976 -0.11208131 -0.061826255 -0.08292078
       Axis.12     Axis.13     Axis.14     Axis.15      Axis.16      Axis.17
1 -0.028074522 -0.03010100  0.17510883  0.21945956  0.109465415  0.075283311
2 -0.004368510  0.05237571 -0.05666681 -0.03448782 -0.124018240 -0.258128825
3  0.006443506  0.01841906 -0.03519646 -0.02368154 -0.010810246 -0.005225330
4  0.068221228  0.06867835 -0.02625598  0.04011823 -0.005187598  0.003907832
5  0.406026310 -0.01269803  0.04131232  0.01937920 -0.020588905 -0.007539995
6  0.005454713 -0.16190160 -0.22238807  0.12523584  0.015402771 -0.009846016
# Select eigen values from dataframe, round to 4 places. These will be the axes for the 3-D plot
# Extract variances from previously generated dataframe, round and multiply by 100 for plotting
eigenvalues<-round(jac_variances[,2], digits = 4)*100

# Plotly - 3-D
# plotly::plot_ly(pcoa_jac_df, type='scatter3d', mode='markers',
#         x=~Axis.2,y=~Axis.3,z=~Axis.1,colors=~brewer.pal(6,'Set1'),color=~TimeofDay)%>%
#   layout(font=list(size=18),
#          title='PCoA Jaccard Distance',
#          scene=list(xaxis=list(title=paste0('Co 2 ',eigenvalues[2],'%'),
#                                showticklabels=FALSE,zerolinecolor='black'),
#                     yaxis=list(title=paste0('Co 3 ',eigenvalues[3],'%'),
#                                showticklabels=FALSE,zerolinecolor='black'),
#                     zaxis=list(title=paste0('Co 1 ',eigenvalues[1],'%'),
#                                showticklabels=FALSE,zerolinecolor='black')))

Euclidean

Performing the log-ratio transformation makes the data all occupy a similar dynamic range, so we can use magnitude-sensitive distances like Euclidean distance.

Euclidean, recommended for analysis of the difference in compositions. e.g., when we are working to understand changes in relative abundance. Because it depends on the composition, we must input transformed data.

# ?dist()
euc_dmat <- dist(asv_df, method = "euclidean") 

Conduct ordination w/distance matrix = PCoA

pcoa_euc<-ape::pcoa(euc_dmat)

Repeat screeplot check. Extract variances from pcoa, from jaccard calculated dist. metric

euc_variances <- data.frame(pcoa_euc$values$Relative_eig) %>% 
  select(PercVar = 'pcoa_euc.values.Relative_eig') %>% 
  rownames_to_column(var = "PCaxis") %>% 
  data.frame
# head(euc_variances)

# Screeplot check
ggplot(euc_variances, aes(x = as.numeric(PCaxis), y = PercVar)) + 
  geom_bar(stat = "identity", fill = "grey", color = "black") +
  theme_minimal() +
  theme(axis.title = element_text(color = "black", face = "bold", size = 10),
        axis.text.y = element_text(color = "black", face = "bold"),
        axis.text.x = element_blank()) +
  labs(x = "PC axis", y = "% Variance", title = "Euclidean\nLog-Ratio PCoA Screeplot")

Compare what the previous scree plot looked like.

Is there a difference in the ordinate output when we use a different metric?

To compare with the euclidean distance ordination representing differences in relative composition. Further note: If your ordination has data which align in a ‘T’ or ‘+’ shape perpendicular to the axes this is often diagnostic of covariance attributed to the higher dimensions which are not plotted.

What happens if I have a screeplot that requires 3D, but I can’t do it?

Although our statistical ordinations appear to require at least 3 dimensions to communicate the data. However, we don’t always have the budget for a 3D plot. So we may want to impose the condition on an ordination technique that the answer MUST go in 2D. We turn to NMDS here.

NMDS

set.seed(1984)

So we can compare the relative composition based distance metric to the presence/absence based distance metric

library(vegan)
# ?metaMDS()

NMDS: force data into a desired number of dimensions, works to preserve all pairwise distances between points.

Start with Euclidean distance again. But then transform with NMDS.

euc_dmat <- compositions::dist(asv_df, method = "euclidean") # From above
euc_nmds <- vegan:: metaMDS(euc_dmat, k = 2, autotransform = FALSE)
Run 0 stress 0.03176447 
Run 1 stress 0.0317942 
... Procrustes: rmse 0.01591051  max resid 0.2519757 
Run 2 stress 0.03271511 
Run 3 stress 0.03191601 
... Procrustes: rmse 0.01869626  max resid 0.3884156 
Run 4 stress 0.03261754 
Run 5 stress 0.03334667 
Run 6 stress 0.0355446 
Run 7 stress 0.0332108 
Run 8 stress 0.03165351 
... New best solution
... Procrustes: rmse 0.0197889  max resid 0.4094653 
Run 9 stress 0.0307389 
... New best solution
... Procrustes: rmse 0.01601655  max resid 0.3265687 
Run 10 stress 0.03206538 
Run 11 stress 0.03114016 
... Procrustes: rmse 0.01832106  max resid 0.3892465 
Run 12 stress 0.03297826 
Run 13 stress 0.02989012 
... New best solution
... Procrustes: rmse 0.02733152  max resid 0.548664 
Run 14 stress 0.03282565 
Run 15 stress 0.03122517 
Run 16 stress 0.03259375 
Run 17 stress 0.03179146 
Run 18 stress 0.03220929 
Run 19 stress 0.03184396 
Run 20 stress 0.03285802 
*** Best solution was not repeated -- monoMDS stopping criteria:
    20: no. of iterations >= maxit

Repeat with the Jaccard transformation

jac_dmat <- vegan:: vegdist(t(asv_df), method = "jaccard") # From above
jac_nmds<- vegan::metaMDS(jac_dmat, k = 2, autotransform = FALSE)
Run 0 stress 0.1423445 
Run 1 stress 0.1919306 
Run 2 stress 0.1373766 
... New best solution
... Procrustes: rmse 0.04095258  max resid 0.1313601 
Run 3 stress 0.1811173 
Run 4 stress 0.1375691 
... Procrustes: rmse 0.01209244  max resid 0.03782542 
Run 5 stress 0.1828614 
Run 6 stress 0.1939918 
Run 7 stress 0.1933375 
Run 8 stress 0.2179407 
Run 9 stress 0.1734027 
Run 10 stress 0.1811173 
Run 11 stress 0.1607189 
Run 12 stress 0.1373766 
... New best solution
... Procrustes: rmse 1.788908e-05  max resid 4.848831e-05 
... Similar to previous best
Run 13 stress 0.1423445 
Run 14 stress 0.1610092 
Run 15 stress 0.1375691 
... Procrustes: rmse 0.01208686  max resid 0.03781589 
Run 16 stress 0.1827459 
Run 17 stress 0.180109 
Run 18 stress 0.1676982 
Run 19 stress 0.1424365 
Run 20 stress 0.2052544 
*** Best solution repeated 1 times

Take a look at stress - overall this value is not extremely informative, but know that the closer stress is to 1 the less representative of your actual data the NMDS is.

euc_nmds$stress 
[1] 0.02989012
jac_nmds$stress
[1] 0.1373766

Additionally, the axes for NMDS are totally arbitrary, so axis scaling does not matter and data can be rotated/reflected about axes and the NMDS is still the same euc_nmds$points #Extract points from NMDS

euc_frame <- data.frame(euc_nmds$points) %>% 
  rownames_to_column(var = "TimeofDay") %>% 
  separate(TimeofDay, into = c("excess", "TimeofDay"), sep = "_") %>% 
  select(-excess) %>% 
  data.frame
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1000 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## Plotting euclidean distance NMDS
ggplot(euc_frame,aes(x = MDS1, y = MDS2, fill = TimeofDay)) +
  geom_point(size = 4, shape = 21, color = "black", aes(fill = TimeofDay)) +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw() +
  labs(x = "NMDS 1", y = "NMDS 2", title = "Euclidean Distance NMDS")