Correlation & Linear regression

Why?

  1. Explore association between two variables, assign a strength to this relationship.

  2. Are variables changing or shifting in the same direction as one another (positive) or in opposition (negative)?

  3. Be careful of assuming causal relationships. Causation != correlation.

  4. Acknowledge what is not measured or observed

Data import & R environment

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Linear regression

When we have two continuous variables we can consider a linear regression. The amount of change in variable A that is associated with change in variable B is considered the “covariance of variables” (i.e., how do 2 variables shift or change together?). A simple linear regression should include an independent variable and a dependent variable. Where a Pearson correlation is the linear relationship between two variables. Data needs to be normally distributed.

r = 0 would indicate no relationship exists, specifically no linear relationship exists. As the value of r increases to |1|, the relationship is stronger.

With example data

Let’s explore how to run base code for linear regressions in R.

# install.packages("palmerpenguins")
library(palmerpenguins)
data(package = "palmerpenguins") # Get information on it

Explore some relationships that might exist in this data

penguins %>% 
  group_by(species) %>% 
  summarize(across(where(is.numeric), mean, na.rm = TRUE)) %>% 
  head
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
ℹ In group 1: `species = Adelie`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1 Adelie              38.8          18.3              190.       3701. 2008.
2 Chinstrap           48.8          18.4              196.       3733. 2008.
3 Gentoo              47.5          15.0              217.       5076. 2008.
penguins %>% 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) +
    geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Re-plot, but show species.

Are our variable normally distributed?

# hist(penguins$flipper_length_mm)
# hist(penguins$body_mass_g)

Are things correlated?

cor.test(penguins$body_mass_g, penguins$flipper_length_mm, method = "pearson")

    Pearson's product-moment correlation

data:  penguins$body_mass_g and penguins$flipper_length_mm
t = 32.722, df = 340, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.843041 0.894599
sample estimates:
      cor 
0.8712018 

p-value: probability the true correlation is not equal to 0. There is a correlation of some kind.

With 95% confidence, the correlation value is between 0.84 and 0.89. We estimate it to be 0.87

cor.test(penguins$body_mass_g, penguins$flipper_length_mm, method = "spearman")
Warning in cor.test.default(penguins$body_mass_g, penguins$flipper_length_mm, :
Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  penguins$body_mass_g and penguins$flipper_length_mm
S = 1066875, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8399741 

Again, a strong correlation is shown. p-value is significant (low) and rho is closer to 1, so there appears to be a strong correlation to explore.

Regressions in R

Another method to compare between two sets of variables in our data. When we see a strong correlation (from above), we can use the linear model function to explore this relationship further.

?lm()

lm(response ~ terms, data = df)

How well does terms explain response in our data(df)?

Could also be viewed as lm(y ~ x, data = df). How well does x explain y?

For different sizes of penguins, how well does their body mass (in grams) explain the flipper length observed (in mm)?

regression_output <- lm(flipper_length_mm ~ body_mass_g, data = penguins)

# View results:
summary(regression_output)

Call:
lm(formula = flipper_length_mm ~ body_mass_g, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.7626  -4.9138   0.9891   5.1166  16.6392 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.367e+02  1.997e+00   68.47   <2e-16 ***
body_mass_g 1.528e-02  4.668e-04   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.913 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

Explanation: Residuals: distance of each point from model output or prediction

Coeffienct: definitions of our relationship - where Estimate reports the slop and intercept of the linear regression (y = mx + b). Therefore:

flipper length = (0.0152 x body mass) + 136.7

Diagnose linear model output

We can explore the results visually. The output from plot() provides a lot of plots to diagnose our lm output.

plot(regression_output)

Fitted vs. residuals - does the linearity assumption hold? When the red line is close to the dashed line, the mean residual at each point is close to 0. When we look across the x-axis, if the trends are similarly spread out, we can check of homoskedasticity as true. We will see extreme outliers at the ends of the plots here as well.

In our penguin dataset, there are violations of the linearity assumption.

Residulas vs. Leverage - this will help us determine the influential data points for the model.

TO DO

  • Add interaction terms

  • Show use of broom

Spearman Rank

Spearman correlations are better for evaluating associations that are nonlinear, but are continuously decreasing or increasing (monotonic). THEN by introducing the “rank”, the analysis shifts into being linear. By changing the scale to produce a ranking of the data or coefficients, Spearman is better suited to deal with outliers, we can input non-normal data, and is used for ordinations.

Explore the data

hist((asv_table |> 
  select_if(is.numeric) |> 
  gather(cols, value))$value)

library(compositions)
Warning: package 'compositions' was built under R version 4.2.3
Welcome to compositions, a package for compositional data analysis.
Find an intro with "? compositions"

Attaching package: 'compositions'
The following objects are masked from 'package:stats':

    anova, cor, cov, dist, var
The following object is masked from 'package:graphics':

    segments
The following objects are masked from 'package:base':

    %*%, norm, scale, scale.default
colnames(asv_table)
 [1] "FeatureID"                             
 [2] "Taxon"                                 
 [3] "GordaRidge_Vent039_SUPRS1_2019"        
 [4] "GordaRidge_Vent040_SUPRS2_2019"        
 [5] "GordaRidge_BSW081_sterivex_2019"       
 [6] "GordaRidge_BSW056_sterivex_2019_REPb"  
 [7] "GordaRidge_Vent088_SUPRS3_2019"        
 [8] "GordaRidge_Plume001_sterivex_2019_REPa"
 [9] "GordaRidge_Plume001_sterivex_2019_REPb"
[10] "GordaRidge_Plume036_sterivex_2019_REPb"
[11] "GordaRidge_Plume096_sterivex_2019"     
[12] "GordaRidge_Vent105_SUPRS9_2019"        
[13] "GordaRidge_Vent041_SUPRS3_2019"        
[14] "GordaRidge_Vent086_SUPRS1_2019"        
[15] "GordaRidge_Vent107_SUPRS11_2019"       
[16] "GordaRidge_Vent009_SUPRS1_2019"        
[17] "GordaRidge_Vent011_SUPRS3_2019"        
[18] "GordaRidge_Vent106_SUPRS10_2019"       
[19] "GordaRidge_Vent010_SUPRS2_2019"        
[20] "GordaRidge_Vent087_SUPRS2_2019"        
# Select only ASV IDs and sample names from Gorda Ridge. Transform the data.
pre_df <- asv_table |> 
  select(FeatureID, starts_with("GordaRidge")) |> 
  column_to_rownames(var = "FeatureID")
# class(pre_df)
head(pre_df)
                                 GordaRidge_Vent039_SUPRS1_2019
000562094e0ee78d5f488dff4c48c294                              8
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              6
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent040_SUPRS2_2019
000562094e0ee78d5f488dff4c48c294                             13
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                             12
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_BSW081_sterivex_2019
000562094e0ee78d5f488dff4c48c294                               0
0009645516609bda2246e1955ff9ec1d                              91
00165708dab37d750eb573b1720ff8bd                               0
0030ad8ce44f257c42daf3673bf92197                              36
0038478be7fb4f097ce93a5e9341af2a                             101
003b5938e31a8c1b1809e0358da894e0                              13
                                 GordaRidge_BSW056_sterivex_2019_REPb
000562094e0ee78d5f488dff4c48c294                                    0
0009645516609bda2246e1955ff9ec1d                                    0
00165708dab37d750eb573b1720ff8bd                                    1
0030ad8ce44f257c42daf3673bf92197                                    0
0038478be7fb4f097ce93a5e9341af2a                                   23
003b5938e31a8c1b1809e0358da894e0                                    0
                                 GordaRidge_Vent088_SUPRS3_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                             35
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Plume001_sterivex_2019_REPa
000562094e0ee78d5f488dff4c48c294                                      0
0009645516609bda2246e1955ff9ec1d                                      0
00165708dab37d750eb573b1720ff8bd                                      0
0030ad8ce44f257c42daf3673bf92197                                      0
0038478be7fb4f097ce93a5e9341af2a                                     87
003b5938e31a8c1b1809e0358da894e0                                      0
                                 GordaRidge_Plume001_sterivex_2019_REPb
000562094e0ee78d5f488dff4c48c294                                      0
0009645516609bda2246e1955ff9ec1d                                      0
00165708dab37d750eb573b1720ff8bd                                      0
0030ad8ce44f257c42daf3673bf92197                                      0
0038478be7fb4f097ce93a5e9341af2a                                     20
003b5938e31a8c1b1809e0358da894e0                                      0
                                 GordaRidge_Plume036_sterivex_2019_REPb
000562094e0ee78d5f488dff4c48c294                                      0
0009645516609bda2246e1955ff9ec1d                                      0
00165708dab37d750eb573b1720ff8bd                                      0
0030ad8ce44f257c42daf3673bf92197                                      0
0038478be7fb4f097ce93a5e9341af2a                                     21
003b5938e31a8c1b1809e0358da894e0                                      0
                                 GordaRidge_Plume096_sterivex_2019
000562094e0ee78d5f488dff4c48c294                                 0
0009645516609bda2246e1955ff9ec1d                                 0
00165708dab37d750eb573b1720ff8bd                                 0
0030ad8ce44f257c42daf3673bf92197                                 0
0038478be7fb4f097ce93a5e9341af2a                                15
003b5938e31a8c1b1809e0358da894e0                                 0
                                 GordaRidge_Vent105_SUPRS9_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                             19
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent041_SUPRS3_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent086_SUPRS1_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent107_SUPRS11_2019
000562094e0ee78d5f488dff4c48c294                               0
0009645516609bda2246e1955ff9ec1d                               0
00165708dab37d750eb573b1720ff8bd                               0
0030ad8ce44f257c42daf3673bf92197                               0
0038478be7fb4f097ce93a5e9341af2a                               0
003b5938e31a8c1b1809e0358da894e0                               0
                                 GordaRidge_Vent009_SUPRS1_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent011_SUPRS3_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent106_SUPRS10_2019
000562094e0ee78d5f488dff4c48c294                               0
0009645516609bda2246e1955ff9ec1d                               0
00165708dab37d750eb573b1720ff8bd                               0
0030ad8ce44f257c42daf3673bf92197                               0
0038478be7fb4f097ce93a5e9341af2a                               0
003b5938e31a8c1b1809e0358da894e0                               0
                                 GordaRidge_Vent010_SUPRS2_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
                                 GordaRidge_Vent087_SUPRS2_2019
000562094e0ee78d5f488dff4c48c294                              0
0009645516609bda2246e1955ff9ec1d                              0
00165708dab37d750eb573b1720ff8bd                              0
0030ad8ce44f257c42daf3673bf92197                              0
0038478be7fb4f097ce93a5e9341af2a                              0
003b5938e31a8c1b1809e0358da894e0                              0
# glimpse(pre_df)
# rownames(pre_df)

Perform center log ratio analysis

# ?clr()
gorda_clr <- as.data.frame(compositions::clr((pre_df)))

# glimpse(gorda_clr)

# median(gorda_clr$GordaRidge_BSW081_sterivex_2019)

Compute spearman correlation

spear_gorda <- cor(t(gorda_clr), method = "spearman")
Warning in stats::cor(x, y, use, method): the standard deviation is zero

Process Spearman correlation results

spear_gorda_df <- as.data.frame(spear_gorda) |> 
  rownames_to_column(var = "PAIR_A") |> 
    pivot_longer(cols = c(-PAIR_A), names_to = "PAIR_B", values_to = "VALUE")

head(spear_gorda_df)
# A tibble: 6 × 3
  PAIR_A                           PAIR_B                            VALUE
  <chr>                            <chr>                             <dbl>
1 000562094e0ee78d5f488dff4c48c294 000562094e0ee78d5f488dff4c48c294  1    
2 000562094e0ee78d5f488dff4c48c294 0009645516609bda2246e1955ff9ec1d NA    
3 000562094e0ee78d5f488dff4c48c294 00165708dab37d750eb573b1720ff8bd NA    
4 000562094e0ee78d5f488dff4c48c294 0030ad8ce44f257c42daf3673bf92197 -0.396
5 000562094e0ee78d5f488dff4c48c294 0038478be7fb4f097ce93a5e9341af2a  0.370
6 000562094e0ee78d5f488dff4c48c294 003b5938e31a8c1b1809e0358da894e0 NA    
hist(spear_gorda_df$VALUE)

filtered_spear_cor <- spear_gorda_df %>% 
  filter((VALUE > 0.75 | VALUE < -0.75)) %>% 
  filter(!(PAIR_A == PAIR_B))
glimpse(filtered_spear_cor)
Rows: 41,562
Columns: 3
$ PAIR_A <chr> "000562094e0ee78d5f488dff4c48c294", "000562094e0ee78d5f488dff4c…
$ PAIR_B <chr> "05f7b80eeaa398df59a1557b071f4614", "066cc40f99a25ef1ca8b144b8e…
$ VALUE  <dbl> -0.8405959, -0.7911491, -1.0000000, -1.0000000, -0.7911491, -0.…

Add in taxonomic information

# head(asv_table)