R ventures

Sarah Hu

10/2/2021

Introduction

The sections below are a compilation of recommended resources in R and common data wrangling practices I use in R that I’ve ‘tutorial-ized’. I welcome additions and recommendations! Note that if it is written below, I probably use it often in my own data analysis.

First, make sure you’re familiar with your R set up. I typically code in R Studio and there are already a ton of great resources for installing and navigating R and R Studio. Everyone’s set up may be a little bit different - I’ve compiled some resources and notes on each resource that might help you decide on how to proceed.

About me

The below information is meant to be a quick guide to some common analyses and data visualization techniques in R. I recommend other tutorials along the way for newer R users.

Installing R & RStudio

Tutorials on R installation & start-up

If you’re brand new to R and the command line, explore the resources through Software Carpentry. This is a great resource for learning how to install RStudio on your computer and getting oriented in the R interface - Introduction to R and RStudio.

Through the BVCN we offered an introduction to R using the browser-supported RStudio cloud. If you already have RStudio installed and working, you can follow Lesson 1. We even have a video.

Another fantastic resource is Riffomonas. See Pat Schloss’s instructions for installing R. I prefer video content when learning new skills, so I appreciate the work Pat’s put into his website and teaching mini R tutorials.

R via Anaconda

If your life is maintained through Anaconda environments, you may want to check out my post on using conda and R. This is how I run R on my computer and it works for me, but it is so I can maintain several R versions and separate environments for different project.

Importing data

If you are unfamiliar with the command line and want to learn more about paths and how to navigate in your terminal I’d recommend following this crash course in Unix. The below instructions will assume you have an understanding of how R is installed on your computer and you can locate the files you need!

Note that gaining a little bit of experience in Unix and command line efficiency is recommended because this is the same mode of thinking required to build up your R environment and navigate to access various files. However, the same commands that work on the command line are not in the R programming language. While many of the words and terms are similar, when you launch R - you’re entering an environment where the R language is required instead.

Data types that work best

One of the more frustrating components of learning any computer language or program is figuring out what I need to start. The most common issues come from a lot of hidden characters and weird inconsistencies between computers and software. Two main recommendations: * Download and get familiar with a good text editor. I use Text Wrangler / BBEdit get it here. * Get familiar with .csv file types. The way data is separated (is it a tab or a comma or a space?) in a data frame is what you need to be aware of when importing data into R. .csv files are a standard file type, so if you’re sharing your data with a collaborator or friend, this is the most straight-forward way to do it to avoid any incompatibilities.

Example: importing data

Grab your own .csv file or download one here. To download, right click here and select Download Linked File As... Save this as a .csv file in a place you can find again.

I have this file saved in a directory called data-input/.

asv_18s_raw <- read.csv(file = "data-input/asv-gordaridge-18s-wide.csv")

View this file in Excel and your favorite text editor to see what it looks like in various formats. You can also check it out in RStudio by running these

## Uncomment the below line and run in RStudio View(asv_18s_raw)

## Print out the first few lines of every column
head(asv_18s_raw)

Importing a .txt file. Download from here.

asv_16s_raw <- read.delim(file = "data-input/asv-gordaridge-16s-wide.txt", sep = "\t")

Try the above commands and modify the sep= and header= options. Pull up the help menu to learn more by executing: ?read.delim() and ?read.csv(). Being familiar with the terminology in these help menus is helpful for learning about the different file types that exist.

Data management & organization

I’d recommend you work with RStudio projects. By containing a project within an RStudio Project, I simply open the .Rproj and it opens up RStudio to the location I need to access all the relevant code and data files to work on my project. Software Carpentry also maintains a good how to use RStudio projects resource that I recommend.

For each project I work on, I maintain a /data-input folder (or directory) that has all of my files I need to import for analysis. Similarly, I typically make a separate directory for output files and figures.

I’d recommend you make up from rules for yourself. My main coding advice is always: keep current you organized, so future you thanks past you. Add lots of notes for yourself on how you organize your work and how you leave it! Then when you open it back up and realize you’ve forgotten everything - there is a nice note from past you.

To leave notes for yourself in R, use the # symbol. In RStudio the code will turn a separate color and this means it is non-executable (it won’t think that line is code!). Use this to “comment” lines and add notes for yourself.

As of now (mid-2021), I have one directory on my computer per project. Within that directory I have one RStudio project and directories for raw data, input and processed data, and output folders for products. My own rule is that I can re-create the entire directory file structure and outputs by running the R code with all the raw input data.

Get basic stats from data

First, let’s work with three different data sets. These are available here. Two ways to download them: * Right click to copy the link and in a terminal run wget <paste link here> * Right click and select Download Linked File As…

Follow along with this lesson from Software Carpentry that details work in dataframes. First, generate this ‘cats’ dataset:

cats <- data.frame(coat = c("calico", "black", "tabby"), weight = c(2.1, 5, 3.2),
    likes_string = c(1, 0, 1))
# write.csv(x = cats, file = 'feline-data.csv', row.names = FALSE)

Import example data

This is tag-sequencing data from an expedition to the Gorda Ridge hydrothermal vent ecosystem.

The first few things I do is to look at the summary statistics

# Dimensions of my data frame
dim(asv_18s_raw)

# Summary information
summary(asv_18s_raw)

Since the example data I’ve provided are from Amplicon Sequence Variants (or ASVs), each row is an ASV (Feature.ID column) and each column is a sample.

# Count number of unique ASVs
length(unique(asv_18s_raw$Feature.ID))

# list sample names in my data
colnames(asv_18s_raw)

Data wrangling

I do all of my data wrangling and analysis using Tidyverse and associated R packages. Tutorials in Riffomonas and from BVCN review code in base R and in Tidyverse.

Get started by installing tidyverse.

# install.packages('tidyverse')
library(tidyverse)

When I Google and troubleshoot my code, I typically type “R tidyverse <how do I do X?>”. This way I get an answer (I favor Stackoverflow for this) that is within the Tidyverse world. I am more familiar with the grammar of these packages, so I learn more this way.

Modify columns in my data frame

Upon viewing my data from above, there’s a few random columns I don’t really want.

Remove the column header X.

select(asv_18s_raw, -X)

Remove the column that starts with Axial_.

select(asv_18s_raw, -starts_with("Axial_"))

Combine these functions, also remove references sequences, and save to a new data frame.

Let’s introduce the pipe from magrittr (which is a part of tidyverse). We also want to change the name of my column from Feature.ID to ASV. So we can combine everything with a pipe-based structure.

# Save to the same data frame output
asv_18s_mod <- asv_18s_raw %>%
    select(-starts_with("Axial_"), -X, -ReferenceSequence) %>%
    select(ASV = Feature.ID, everything())

# Add head at the end to troubleshoot as you go
asv_18s_raw %>%
    select(-starts_with("Axial_"), -X, -ReferenceSequence) %>%
    select(ASV = Feature.ID, everything()) %>%
    head

To continue exploring navigating with data frames, look at Riffomonas or Software Carpentry.

Long versus short format

Available within the tidyverse is the starwars example data. We can use this to quickly show a few examples. Load the starwars example data.

# library(tidyverse)
data("starwars")

The data frame is already in wide format, so when we subset columns of interest we output the wide format. Select character name and the character’s height and mass.

starwars %>%
    select(name, height, mass) %>%
    head

# A tibble: 6 × 3
  name           height  mass
  <chr>           <int> <dbl>
1 Luke Skywalker    172    77
2 C-3PO             167    75
3 R2-D2              96    32
4 Darth Vader       202   136
5 Leia Organa       150    49
6 Owen Lars         178   120

Repeat that, but let’s chose to convert the height and mass columns to long format. The new column we create will be called size, which defines height or mass.

starwars %>%
    select(name, height, mass) %>%
    pivot_longer(cols = height:mass, names_to = "size") %>%
    head

# A tibble: 6 × 3
  name           size   value
  <chr>          <chr>  <dbl>
1 Luke Skywalker height   172
2 Luke Skywalker mass      77
3 C-3PO          height   167
4 C-3PO          mass      75
5 R2-D2          height    96
6 R2-D2          mass      32

# ?pivot_longer() ## Check out the help menu too

Convert data to long format

In this example, all the sample names that I want put into long format start with GordaRidge_. I can use starts_with() to select all of them.

asv_18s_raw %>%
    select(-starts_with("Axial_"), -X, -ReferenceSequence) %>%
    select(ASV = Feature.ID, everything()) %>%
    # Pivot longer parameter:
pivot_longer(starts_with("GordaRidge_"), names_to = "samples") %>%
    head

Separate sample names

My sample names are a pretty long collection of underscores. Let’s separate them.

asv_18s_raw %>%
    select(-starts_with("Axial_"), -X, -ReferenceSequence) %>%
    select(ASV = Feature.ID, everything()) %>%
    pivot_longer(starts_with("GordaRidge_"), names_to = "samples") %>%
    # Added separate here:
separate(samples, c("Site", "VentID", "SampleType", "Year", "Rep"), sep = "_") %>%
    head

The default parameter in separate is to remove the original column that you separate. I often change this, because I want to keep as much of the original data frame intact.

Also, there may be a warning or error message that stats there was missing information. This is because not all of my samples had replicates. So an “NA” was placed in these spots. We can ignore this.

asv_18s_raw %>%
    select(-starts_with("Axial_"), -X, -ReferenceSequence) %>%
    select(ASV = Feature.ID, everything()) %>%
    pivot_longer(starts_with("GordaRidge_"), names_to = "samples") %>%
    # Set the remove argument to 'FALSE'
separate(samples, c("Site", "VentID", "SampleType", "Year", "Rep"), sep = "_", remove = FALSE) %>%
    head

I also want to separate the taxonomic names. Let’s add this step and set this equal to a new data frame.

# Set equal to a new data frame
asv_18s_mod <- asv_18s_raw %>%
    select(-starts_with("Axial_"), -X, -ReferenceSequence) %>%
    select(ASV = Feature.ID, everything()) %>%
    pivot_longer(starts_with("GordaRidge_"), names_to = "samples") %>%
    separate(samples, c("Site", "VentID", "SampleType", "Year", "Rep"), sep = "_",
        remove = FALSE) %>%
    # Separate taxa
separate(Taxon_updated, c("Kingdom", "Supergroup", "Division", "Class", "Order",
    "Family", "Genus", "Species"), sep = ";", remove = FALSE)

Subset samples

Now that my sample names are separated, there are a few sample types I do not want. I can conditionally remove them.

First, list unique values in my Sample types:

unique(asv_18s_mod$SampleType)

 [1] "T24"      "sterivex" "T0"       "T36"      "CTRL"     "SUPRS1"  
 [7] "SUPRS2"   "SUPRS3"   "SUPRS9"   "SUPRS10"  "SUPRS11"

I want to remove the Tx sample types (those are time points I don’t want to look at for now). We will use the tidyverse function filter() for this.

# Conditionally, define all of them as a character list
remove_tx <- c("T24", "T0", "T36")

We can use filter by telling it that we do not want to include the Sample Types equal to my list. filter(asv_18s_mod, !(SampleType %in% remove_tx)). This is noted with the ! outside of the (). If we remove that, we would select only Sample Types that appear in my list. filter(asv_18s_mod, SampleType %in% remove_tx)

Let’s remove those time points and the control sample using two filter commands.

asv_18s_filtered <- asv_18s_mod %>%
    filter(!(SampleType %in% remove_tx)) %>%
    filter(SampleType != "CTRL")

unique(asv_18s_filtered$SampleType)  #Controls and time points are not included in my sample types anymore.
# head(asv_18s_filtered)

Another way we can use filter is to clean up data by value. In this case, I want to remove ASVs that have only 1 sequence.

asv_18s_filtered <- asv_18s_mod %>%
    filter(!(SampleType %in% remove_tx)) %>%
    filter(SampleType != "CTRL") %>%
    # Value must be greater than 1
filter(value > 1)

Summarize & average counts

The next step is to really start learning some stuff from these data. I want to summarize by certain taxonomic levels, but I first want to average across replicates (Reps!).

For this, we will use the very useful group_by() function. This is very handy for data analysis in tidyverse, it is the primary reason the long format data are preferred (in my opinion).

group_by() and other functions have their own informative help pages that you can find on the tidyverse website or via the help menu ?group_by() in R.

Average across samples

Listed columns in group_by() will be the columns you want in your final data frame. In this case, I am selecting everything expect for the columns that have information on the replicate IDs. This way, the mean() of the value will be taken among replicate samples.

We use summarise() to create a new table based on the information provided to group_by(). I am creating a new data frame with a column AVERAGE that will be the mean of the sequence counts from the value column.

# names(asv_18s_filtered)

asv_18s_avg <- asv_18s_filtered %>%
    group_by(ASV, Taxon_updated, Site, VentID, SampleType) %>%
    summarise(AVERAGE = mean(value))

## Above function calculates the mean across replicates (because replicate IDs
## were not included in the group_by argument)

After averaging across replicates, I now want sum sequence counts within a specific taxonomic group.

I didn’t want to type out the different taxonomic levels into my group_by function, so let’s separate those again. THEN summarize to the Supergroup level. Additionally, for unassigned ASVs to the Supergroup level, the final mutate statement replaces NA with Eukaryote Unassigned.

asv_supergroup <- asv_18s_avg %>%
    separate(Taxon_updated, c("Kingdom", "Supergroup", "Division", "Class", "Order",
        "Family", "Genus", "Species"), sep = ";", remove = FALSE) %>%
    # Select supergroup and sample ID columns
group_by(Supergroup, VentID, SampleType) %>%
    summarise(SUM = sum(AVERAGE)) %>%
    # Replace NAs in the Supergroup column
mutate(Supergroup = replace_na(Supergroup, "Eukaryote Unassigned"))

The difference between summarize() and mutate(), is that summarize() generates a new data frame that is smaller than your original, while mutate adds a column to your existing data frame. If the above command prints an error, note that it is likely because some of the separating taxonmy to the species level has left some columns blank.

Plotting my data

Simple bar plot

Plot the Supergroup level relative abundance from data frame created above.

Bar plot customization

Let’s explore changing the aesthetics of this figure.

Custom colors (factoring)

For bar plots, the outline of the bar is the color and I want the fill color to be associated with Supergroup. The fill parameter is inside the aes() argument in the first line of the ggplot code.

This means we can change the fill color manually or with built in palettes by adding a line scale_fill_manual() or scale_fill_brewer().

Try a few of the built-in brewer options:

ggplot(asv_supergroup, aes(x = VentID, y = SUM, fill = Supergroup)) +
  geom_bar(stat = "identity", color = "black", position = "fill") +
  # Set the fill color palette
scale_fill_brewer(palette = "Set3")

Execute ?scale_fill_brewer() to see the color options one can select with palette.

Often, we are working with too many categories or we want our own custom colors. For this example we will use the 10 categories within Supergroup. You can see how many classifications there are by typing length(unique(asv_supergroup$Supergroup)).

ggplot(asv_supergroup, aes(x = VentID, y = SUM, fill = Supergroup)) +
  geom_bar(stat = "identity", color = "black", position = "fill") +
  # Set 10 fill colors
scale_fill_manual(values = c("#f768a1","#ef3b2c", "#fd8d3c", "#fff7bc", "#d95f0e", "#7fcdbb", "#238443","#2b8cbe", "#6a51a3","#bcbddc"))

But, what if we want to change the order of the items in the legend and assign colors to specific supergroups?

First, factor the Supergroup column by adding a new column that lists the Supergroup factors in the order you want.

Repeat ggplot with the new column as the fill argument and set the manual fill values to the list of colors.

# I've set the fill parameter to SUPERGROUP_ORDER, which is a factor that I set above
ggplot(asv_supergroup, aes(x = VentID, y = SUM, fill = SUPERGROUP_ORDER)) +
  geom_bar(stat = "identity", color = "black", position = "fill") +
  # Set fill color to character list from above
scale_fill_manual(values = colors_ten)

The reason we’d want to assign colors to specific Supergroups, is because there are cases where we’d want to remove a group.

Let’s remove Opisthokonta, but keep the same color schema. We can use the pipe other tidy statements like filter() ahead of the ggplot statement to modify the data before it is plotted.

# Start with the data frame
asv_supergroup %>% 
  filter(Supergroup != "Opisthokonta") %>% #Filter
  # Add ggplot statement at end of pipe
  ggplot(aes(x = VentID, y = SUM, fill = SUPERGROUP_ORDER)) +
    geom_bar(stat = "identity", color = "black", position = "fill") +
  # Set fill color to character list from above
    scale_fill_manual(values = colors_ten)

`theme()`

Now that we have our custom color schematic, we can also modify other components of the ggplot using theme(). There are built in ggplot themes or there is a long list of cosmetic changes you can make with theme().

Text modification

I want to rotate the labels along the x-axis, so I can use the theme argument axis.text.x to modify the text element of the plot. I’ve also added theme_bw().

asv_supergroup %>% 
  filter(Supergroup != "Opisthokonta") %>%
  ggplot(aes(x = VentID, y = SUM, fill = SUPERGROUP_ORDER)) +
    geom_bar(stat = "identity", color = "black", position = "fill") +
    scale_fill_manual(values = colors_ten) +
  ## Add theme components here:
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

We can change the color, angle, and justification of the axis labels more. Try the commented out and theme option instead of the last two lines to see how this builds upon our base axis labels.

asv_supergroup %>% 
  filter(Supergroup != "Opisthokonta") %>%
  ggplot(aes(x = VentID, y = SUM, fill = SUPERGROUP_ORDER)) +
    geom_bar(stat = "identity", color = "black", position = "fill") +
    scale_fill_manual(values = colors_ten) +
  ## Additional axis label modifications
  theme_bw() +
  # theme(axis.text.x = element_text(angle = 45, color = "black"))
  theme(axis.text.x = element_text(angle = 45, color = "black", face = "bold", size = 12),
        axis.text.y = element_text(color = "black", face = "bold", size = 12))

In ggplot2 we also need to include a horizontal or vertical justification, or hjust and vjust.

asv_supergroup %>% 
  filter(Supergroup != "Opisthokonta") %>%
  ggplot(aes(x = VentID, y = SUM, fill = SUPERGROUP_ORDER)) +
    geom_bar(stat = "identity", color = "black", position = "fill") +
    scale_fill_manual(values = colors_ten) +
  ## Additional axis label modifications
  ## hjust and vjust added for x axis.
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, color = "black", face = "bold", 
                                   size = 12, hjust = 1, vjust = 1),
        axis.text.y = element_text(color = "black", face = "bold", size = 12))

This hjust and vjust argument can also be explained using this figure:

Source from Stackoverflow

Label axes

To label the x and y axis, we can use the function labs(), so we can assign x = “x-axis label”, y = “y-axis label”, and title = “optional figure title”. In the below example, I’ve only added a y-axis label and left the x-axis label blank. This is because I already have my samples named along the x-axis.

asv_supergroup %>% 
  filter(Supergroup != "Opisthokonta") %>%
  ggplot(aes(x = VentID, y = SUM, fill = SUPERGROUP_ORDER)) +
    geom_bar(stat = "identity", color = "black", position = "fill") +
    scale_fill_manual(values = colors_ten) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, color = "black", 
                                   size = 12, hjust = 1, vjust = 1),
        axis.text.y = element_text(color = "black", size = 12)) +
  ## Add axis labels
  labs(x = "", y = "Relative abundance")

A parting note on bar plots - consider when a box plot (or violin plot) is more appropriate than a bar plot.

Box & violin plots

As an example, will use the Star Wars dataset through tidyverse again.

library(tidyverse)

# First what is the data I'm looking at on Tatooine
# View(starwars)

tatooine <- starwars %>% 
  filter(homeworld == "Tatooine") %>% # select only those from Tatooine
  data.frame

How many species are on Tatooine?

unique(starwars$species); length(unique(starwars$species))

 [1] "Human"          "Droid"          "Wookiee"        "Rodian"        
 [5] "Hutt"           "Yoda's species" "Trandoshan"     "Mon Calamari"  
 [9] "Ewok"           "Sullustan"      "Neimodian"      "Gungan"        
[13] NA               "Toydarian"      "Dug"            "Zabrak"        
[17] "Twi'lek"        "Vulptereen"     "Xexto"          "Toong"         
[21] "Cerean"         "Nautolan"       "Tholothian"     "Iktotchi"      
[25] "Quermian"       "Kel Dor"        "Chagrian"       "Geonosian"     
[29] "Mirialan"       "Clawdite"       "Besalisk"       "Kaminoan"      
[33] "Aleena"         "Skakoan"        "Muun"           "Togruta"       
[37] "Kaleesh"        "Pau'an"

[1] 38

unique(tatooine$species); length(unique(tatooine$species))

[1] "Human" "Droid"

[1] 2

Let us examine the number of species on Tatooine.

starwars %>% 
  filter(homeworld == "Tatooine") %>%
  group_by(species, homeworld) %>%
  summarise(MEAN_height = mean(height), MEDIAN_height = median(height),
            MAX_height = max(height), MIN_height = min(height),
            MEAN_mass = mean(mass), MEDIAN_mass = median(mass))

# A tibble: 2 × 8
# Groups:   species [2]
  species homeworld MEAN_height MEDIAN_height MAX_height MIN_height MEAN_mass
  <chr>   <chr>           <dbl>         <dbl>      <int>      <int>     <dbl>
1 Droid   Tatooine         132           132         167         97      53.5
2 Human   Tatooine         179.          180.        202        163      NA  
# … with 1 more variable: MEDIAN_mass <dbl>

Try it as a bar plot first.

ggplot(tatooine, aes(x = species, y = height)) +
  geom_bar(stat = "identity")

Examine data as a histogram.

hist((filter(tatooine, species == "Human"))$height)

hist((filter(tatooine, species == "Droid"))$mass)

A better way to represent this data is with a box plot.

ggplot(tatooine, aes(x = species, y = height)) +
  geom_boxplot()

In these box plots, the middle of the box is the median. The upper and lower hinges are the 1st and 3rd quartiles (25th and 75th percentiles). Whiskers on either side represent the largest and lowest value - outliers from this will be shown as data points.

We can add data points on top of the box plot.

ggplot(tatooine, aes(x = species, y = height)) +
  geom_boxplot() +
  geom_point()

Or try it with geom_jitter()

ggplot(tatooine, aes(x = species, y = height)) +
  geom_boxplot() +
  geom_jitter()

Repeat this box plot & data points with mass.

ggplot(tatooine, aes(x = species, y = mass)) +
  geom_boxplot() +
  geom_jitter()

We can also add aesthetics similar to what we did above with the bar plot and set the plot equal to an R object.

# Let's add aesthetics and set it equal to a variable
bar_num_species <- ggplot(tatooine, aes(x = species)) +
  geom_bar(stat = "count", width = 0.4, color = "black", fill = "orange") +
  # Note different in color vs. fill with ggplot
  theme_bw() + # always google themes!
  labs(x = "Species", y = "Total number found on Tatooine") +
  theme(axis.text.x = element_text(face = "bold", color = "black"))


bar_num_species

### Facets Example of pivoting to long format and using facet_grid().

Create long format star wars data set

And an example where we will assign colors to the species in our plot first:

# Factor colors:
species_order <- c("Human", "Droid")
species_color <- c("#c51b8a", "#31a354")

starwars_long$SPECIES_ORDER <- factor(starwars_long$species, levels = species_order)
names(species_color) <- species_order

#
height_mass <- ggplot(starwars_long, aes(x = species, y = VALUE)) +
  geom_jitter() +
  geom_boxplot(aes(fill = species), color = "black") +
  scale_fill_manual(values = species_color) +
  facet_grid(. ~ MEASUREMENT) + # Play with this
  labs(x = "Species", y = "Value") +
  theme_bw() +
  theme(axis.text = element_text(color = "black"),
        strip.background = element_blank(),
        legend.position = "none")

height_mass

height_mass %+% subset(starwars_long, species %in% "Human")

height_mass %+% subset(starwars_long, species %in% "Droid")

Combining plots

We can use the R package “patchwork”

Lets create a section plot called count.

count <- ggplot(starwars_long, aes(x = species)) +
  geom_bar(stat = "count", aes(fill = species), color = "black") +
  scale_fill_manual(values = species_color) +
  labs(x = "Species", y = "Count") +
  theme_bw() +
  theme(axis.text = element_text(color = "black"),
        strip.background = element_blank(),
        legend.position = "none")

Use pathwork to stitch them together. Either side by side:

Or on top of one another:

Tile plots & heatmaps

Lets use the Star Wars dataset through tidyverse again.

library(tidyverse)
# First what is the data I'm looking at on Tatooine
# View(starwars)

For a simple tile plot or heat map, lets plot character name to height and mass. Of course this needs to be in long format first.

**To add some additional skills, below I show how to use the ‘pipe’ or %>% in R. And how we can modify dataframes without creating new R objects.

# A tibble: 174 × 3
   name           TRAIT  VALUE
   <chr>          <chr>  <dbl>
 1 Luke Skywalker height   172
 2 Luke Skywalker mass      77
 3 C-3PO          height   167
 4 C-3PO          mass      75
 5 R2-D2          height    96
 6 R2-D2          mass      32
 7 Darth Vader    height   202
 8 Darth Vader    mass     136
 9 Leia Organa    height   150
10 Leia Organa    mass      49
# … with 164 more rows

Here I show how we can use geom_tile(). First, lets isolate droids only.

Now modify some of the aesthetics and ggplot theme.

This isn’t that informative because there are only two traits shown here, and the values are not comparable (height is length and mass is weight!). But this demonstrates the general workflow for how to create a heatmap in R. There are several options to modify the color schematic, let’s explore those first and then move on to a more complicated example.

To include custom color schematics and sequential breaks see this page for more information.

Heat map instead of bar plot

Return to the bar plot from above and remake it as a heat map.

To start again: Grab your own .csv file or download one here. To download, right click here and select Download Linked File As... Save this as a .csv file in a place you can find again.

I have this file saved in a directory called data-input/.

asv_18s_raw <- read.csv(file = "data-input/asv-gordaridge-18s-wide.csv")

Repeat data modification in fewer steps - form new R object called ‘asv_18s_forheat’

A strength of the heat map for making taxonomy plots is that you can typically include more information. In the above example, we averaged across replicates and summed to the supergroup level. We can include division and class this time.

Lets try to create thee bar plot again, but focusing only on the supergroups: Rhizaria and Alveolata

asv_18s_class %>% 
  filter(Supergroup == "Rhizaria" | Supergroup == "Alveolata") %>%
  ggplot(aes(x = VentID, y = SUM, fill = Class)) +
    geom_bar(stat = "identity", color = "black", position = "fill") +
  ## Additional axis label modifications
  ## hjust and vjust added for x axis.
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, color = "black", face = "bold", 
                                   size = 12, hjust = 1, vjust = 1),
        axis.text.y = element_text(color = "black", face = "bold", size = 12))

**This is way too many colors!

Convert to a heat map, but first need to transform the value data to relative abundance.

asv_18s_class %>% 
  filter(Supergroup == "Rhizaria" | Supergroup == "Alveolata") %>%
  group_by(VentID) %>% 
  mutate(SAMPLE_TOTAL = sum(SUM),
         RelAbun = SUM/SAMPLE_TOTAL) %>% 
  ggplot(aes(x = VentID, y = RelAbun, fill = Class)) +
  # Modified position = "stack" here to demonstrate that it is the same outcome as position = "fill"
    geom_bar(stat = "identity", color = "black", position = "stack") +
  ## Additional axis label modifications
  ## hjust and vjust added for x axis.
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, color = "black", face = "bold", 
                                   size = 12, hjust = 1, vjust = 1),
        axis.text.y = element_text(color = "black", face = "bold", size = 12))

NOW, let’s convert to a heatmap.

Repeat with theme changes

All of the classes are listed together. We can use facet wrap to separate by Supergroup.

To explore these plots in more detail, see these examples: * Hu et al. 2021 code

Hu et al. in review

Other resources for plotting data

Advanced ggplot in R | Software Carpentries course taught at WHOI