## Installations
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("DESeq2")
# BiocManager::install("tximport")eukrhythmic compilation
Working with eukrhythmic output files
List of output directories:
final-files/
00-nucleotide_assembly/MAD.filtered.nospace.fasta: Full length sequences of merged assembly groups (MAD) that have been de-duplicated
01-predicted_proteins/MAD.fasta.transdecoder.cds: Identified coding regions of transcripts from the Merged assembly groups (MAD)MAD.fasta.transdecoder.pep: Translated peptide sequences from the Merged assembly groups (MAD)
02-annotation_table/TaxonomicAndFunctionalAnnotations.csv: Taxonomic levels, GOs, PFAMs, and KEGG KOs - assigned by short seq ID and the full sequence ID.
03-abundance_tables/SeqID_Dict_NoSpaceNames.csv: key to link long sequence read IDs to ShortSeq IDs that will link to taxonomic and functional annotations
intermediate-files/
04-compare14-MAD-mapping-filtered: location of individual salmon outputs for all samples (quant.sf)
Compilation scripts
collate-results.py: description of code
~Arianna - how do you want to handle this?~
Notes for Sarah & Arianna: hopefully we can include a description of taking outputs from eukrhythmic and putting them in specific directories, etc. that can go stright into the run_tximport.R below.
Obtain library normalized count data
R script to run this: scripts/run_tximport.R.
Our goal is to take eukrhthymic estimated counts (from salmon) to get transcript-level abundance estimates that will account for gene length. The outcome from this is that we get average transcript length across samples. For this we want to import all data into DESeq using tximport as an interface.
- tximport
- Why tximport? tximport allows us to correct for the changes in gene length across our samples. This way we can use transcript-level estimates for our metatranscriptomic analysis. Also, tximport knows the output from salmon, so we can easily put the salmon output into tximport. Then it can link directly to edgeR or DESeq2.
- DESeq
- DESeq: Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8
- Explaination using DESeq: https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
- https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf
Set up R environment
Library required to execute code in R
Load necessary libraries & get number of available threads for data.table.
# | message: false
# | warning: false
library(data.table)
library(tidyr)
library(dplyr)
library(tximport)
library(readr)
num_threads <- getDTthreads()
# Conflicts between tidyverse and data.table can be ignoredOutput files fromrun_tximport.R provide an RData file that can be used for the DESeq input file (tximport_metaT-ALOHA-CA.Rdata).
DESeq2 analysis
Import as DESeq object from tximport.
library(DESeq2)
library(tidyverse)
load("tximport_metaT-ALOHA-CA.Rdata", verbose = TRUE)
# colData : Rows of colData correspond to columns of countData
# design ~ x == serves the purpose of considering biological replicatesWorking with R objects, txi and sample_merged.
By sample name
For taxonomic bar plot and overall cluster analysis. (March and July are separate at ALOHA)
See how to set up subsetting? https://github.com/WHOIGit/super-waffle/blob/main/01_seqpain_deseq.R
ds_tpm_samplename <- DESeqDataSetFromTximport(txi,
colData = sample_merged,
design = ~0 + SAMPLENAME)optional
keep <- rowSums(counts(ds_tpm_samplename)) >= 10
dds <- ds_tpm_samplename[keep,]dds includes > 10. Now dim is 6972094, it was originally, 14737135. Keeping about 47% of data.
Extract count table
library(tximport)
counts_scaled <- makeCountsFromAbundance(
as.matrix(txi$counts),
as.matrix(txi$abundance),
as.matrix(txi$length),
countsFromAbundance = "scaledTPM"
)
counts_df <- as.data.frame(counts_scaled)Rename so replicates have the same name for counts
names_orig <- colnames(counts_df)
names_new <- sub("_[^_]+$", "", names_orig)
colnames(counts_df) <- names_newMean across columns that have the same name - which are replicates.
mean_counts_df <- counts_df %>%
cbind(as.list(.) %>%
Filter(is.numeric, .) %>%
split(names(.)) %>%
lapply(as.data.frame) %>%
lapply(rowMeans) %>%
setNames(paste0("mean.", names(.)))) %>%
select(starts_with("mean"))save(mean_counts_df, sample_merged, file = "Avg_scaled_tpm_08222023.RData")Extract centered data
See “Count data transformations” in https://introtogenomics.readthedocs.io/en/latest/2021.11.11.DeseqTutorial.html
Use variance Stabilizing Transformation (vst) command to estimate dispersion.
# output is a DESeqtransform object
vsd_all <- vst(ds_tpm_samplename)
# make a transformed count matrix
# vsd_blind <- vst(ds_tpm_samplename, blind = FALSE)Example output: > using ‘avgTxLength’ from assays(dds), correcting for library size
Look at output with assay.
head(assay(vsd_all), 4)Extract as a data frame, where values are center normalized - based on avgTxLength.
df_ctr_norm <- as.data.frame(assay(vsd_all))save(df_ctr_norm, file = "normed_center_df_08222023.RData")DEseq2 reanalysis for DE genes
Repeat DESeq2 design to extracted differentially expressed genes for specific patterns below. Subset is required ahead of time - we can do this with txi directly.
There are a few different ways we want to perform the untargeted analysis. First we will do a ‘whole community’ series of analyses to look at overall amount of differentially expressed genes. Below we will set up parameters associated with the queries we’ve planned.
See script scripts/run_get_params.R This is stored on GRACE.
Create subset parameters
Subset samples as parameters
Use the metadata data frame to create sample lists for the comparisons we want to do.
metadata <- read.csv("input-data/sample-list-revised.txt")For the whole community work, we want to compare CA vs NPSG and euphotic vs. subeuphotic.
all_samples <- metadata %>%
select(sample = SAMPLE)
npsg_only <- metadata %>%
filter(PACIFIC_REGION == "NPSG") %>%
select(sample = SAMPLE)
# Compare euphotic and subeuphotic
# Use this to compare across months too
ca_only <- metadata %>%
filter(PACIFIC_REGION != "NPSG") %>%
select(sample = SAMPLE)
# Compare euphotic and subeuphotic
euphotic <- metadata %>%
filter(LIGHT == "Euphotic")%>%
select(sample = SAMPLE)
# Compare NPSG to CA euphotic zone
subeuphotic <- metadata %>%
filter(LIGHT != "Euphotic")%>%
select(sample = SAMPLE)
# Compare NPSG to CA subeuphotic zoneSubset genes & taxa as parameters
Need to make a list of the SequenceID from the taxonomy and functional annotations.
taxfxn <- read.csv("TaxonomicAndFunctionalAnnotations.csv")Genes
Determine character lists from gene list as subsets for our later queries.
# Get list of annotations where a GO, PFAM, and KEGG ID were assigned.
genes_fxn_all <- as.character(
filter(taxfxn, GOs != "-" & PFAMs != "-" & KEGG_ko != "-") %>%
select(SequenceID) %>%
.[["SequenceID"]]) #this line outputs the selected vector from the pipe
# Removing unclassified and unannotated sequences
genes_tax_fxn_all <- as.character(
filter(taxfxn, GOs != "-" & PFAMs != "-" & KEGG_ko != "-"
& Domain == "Eukaryota" & Supergroup != "Unclassified") %>%
select(SequenceID) %>%
.[["SequenceID"]])From KEGG database reformatting make key_geneid. Isolate genes that are curated.
kegg <- read.csv("../KEGG_DB/combined_kegg.csv")
head(kegg$KO_number)
curated_kegg <- read.csv("../KEGG_DB/reformat-kegg-pfam-skh.csv")
key_geneid <- curated_kegg %>%
select(-X) %>%
right_join(kegg %>% select(KEGG = KO_number, everything(), -X)) %>%
distinct() %>%
select(starts_with("KeggOrthology_"), Category01, Category02, FullName, GeneID, Gene_identification, KEGG, PFAM, Descriptions, REF = REFs)
# write.csv(key_geneid, file = "keygene_id.csv")kegg_keep <- as.character(key_geneid %>%
filter(Category01 != "" & !(is.na(Category01))) %>%
select(KEGG) %>%
.[["KEGG"]])
# length(kegg_keep)
genes_kegg_curated <- as.character(taxfxn %>%
mutate(KEGG_mod = str_remove_all(KEGG_ko, "ko:")) %>%
filter(KEGG_mod %in% kegg_keep) %>%
select(SequenceID) %>%
.[["SequenceID"]])Taxa
Isolate transcript IDs (SequenceID) of individual taxa. For each taxa subset, we are also only selecting transcript IDs with PFAM or KEGG IDs.
For queries, we want to isolate:
euks_only <- as.character(
filter(taxfxn, Domain == "Eukaryota" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])
diatom <- as.character(
filter(taxfxn, Class == "Bacillariophyta" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])
#Dinoflagellata pPhylum
dinos <- as.character(
filter(taxfxn, Phylum == "Dinoflagellata" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])
#Haptophyta in phylum
hapto <- as.character(
filter(taxfxn, Phylum == "Haptophyta" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])
# Chlorophyta
chloro <- as.character(
filter(taxfxn, Phylum == "Chlorophyta" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])
#Ciliophora
ciliate <- as.character(
filter(taxfxn, Phylum == "Ciliophora" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])
# Rhizaria at supergroup
rhizaria <- as.character(
filter(taxfxn, Supergroup == "Rhizaria" & (PFAMs != "-" | KEGG_ko != "-")) %>%
select(SequenceID) %>%
.[["SequenceID"]])Save objects for subseting.
save(
# These are data frames
all_samples, npsg_only, ca_only, euphotic, subeuphotic,
# these are lists (character lists)
genes_fxn_all, genes_tax_fxn_all, key_geneid, genes_kegg_curated,
euks_only, diatom, dinos, hapto, chloro, ciliate, rhizaria,
file = "sample-gene-lists-forTXI.RData")Txi function
Code for this function is from: https://rdrr.io/github/hcnh174/hlsgr/man/subsetTxi.html Code: https://rdrr.io/github/hcnh174/hlsgr/src/R/rnaseq.R
# Subset txi directly
subsetTxi <- function(txi, samples, include_genes=rownames(txi$counts))
{
genes <- rownames(txi$counts)[rownames(txi$counts) %in% include_genes]
txi$abundance <- txi$abundance[genes, samples$sample]
txi$counts <- txi$counts[genes, samples$sample]
txi$length <- txi$length[genes, samples$sample]
return(txi)
}
# Example usage
# tmp <- sample(taxfxn$SequenceID,10,replace = FALSE)
# pola <- data.frame(sample = c("PortofLA_1", "PortofLA_2"))
# txi_pola <- subsetTxi(txi, pola, tmp)Whole community DESeq script
Compare across light and pacific region, first isolate only eukaryotes and annotated transcripts.
All samples
See script scripts/run_txi_wholecomm.R.
# Subset eukaryotes only, and keep all samples.
txi_euk_annot <- subsetTxi(txi, all_samples, euks_only)
ds_tpm_light <- DESeqDataSetFromTximport(txi_euk_annot,
colData = sample_merged,
design = ~0 + LIGHT)
ds_tpm_pac <- DESeqDataSetFromTximport(txi_euk_annot,
colData = sample_merged,
design = ~0 + PACIFIC_REGION)
ds_tpm_lightpac <- DESeqDataSetFromTximport(txi_euk_annot,
colData = sample_merged,
design = ~0 + PACIFIC_REGION + LIGHT)
save(ds_tpm_light, ds_tpm_pac, ds_tpm_lightpac, file = "light-pacregion-deseq.RData")By region
See script scripts/run_txi_region.r.
# Subset txi
txi_npsg <- subsetTxi(txi, npsg_only, euks_only)
# Reset sample_merged
## Subset
tmp_sample_merged <- sample_merged %>%
filter(Sample_rep %in% as.character(npsg_only$sample))
## Set names
rownames(tmp_sample_merged) <- tmp_sample_merged$Sample_rep
rownames(tmp_sample_merged) <- colnames(txi_npsg$counts)
# Compare Euphotic vs. sub-euphotic samples in the NPSG
ds_tpm_npsg_light <- DESeqDataSetFromTximport(txi_npsg,
colData = tmp_sample_merged,
design = ~0 + LIGHT)
# Compare July vs. March in the NPSG
ds_tpm_npsg_month <- DESeqDataSetFromTximport(txi_npsg,
colData = tmp_sample_merged,
design = ~0 + MONTH)
save(ds_tpm_npsg_light, ds_tpm_npsg_month, file = "/vortexfs1/scratch/sarahhu/txi-objs-metaT/npsg-deseq.RData")txi_ca <- subsetTxi(txi, ca_only, euks_only)
tmp_sample_merged <- sample_merged %>%
filter(Sample_rep %in% as.character(ca_only$sample))
rownames(tmp_sample_merged) <- tmp_sample_merged$Sample_rep
rownames(tmp_sample_merged) <- colnames(txi_ca$counts)
# Compare euphotic vs. subeuphotic in coastal California
## Includes Port of LA and Catalina
ds_tpm_ca_light <- DESeqDataSetFromTximport(txi_ca,
colData = tmp_sample_merged,
design = ~0 + LIGHT)
save(ds_tpm_ca_light, file = "/vortexfs1/scratch/sarahhu/txi-objs-metaT/ca-deseq.RData")
##
rm(txi_npsg);rm(txi_ca) #Save roomBy light availability
See script scripts/run_txi_light.r.
# Subset txi
txi_euph <- subsetTxi(txi, euphotic, euks_only)
tmp_sample_merged <- sample_merged %>%
filter(Sample_rep %in% as.character(euphotic$sample))
rownames(tmp_sample_merged) <- tmp_sample_merged$Sample_rep
rownames(tmp_sample_merged) <- colnames(txi_euph$counts)
# Compare CA versus NPSG within euphotic samples
ds_tpm_euphotic <- DESeqDataSetFromTximport(txi_euph,
colData = tmp_sample_merged,
design = ~0 + PACIFIC_REGION)
txi_subeuph <- subsetTxi(txi, subeuphotic, euks_only)
tmp_sample_merged <- sample_merged %>%
filter(Sample_rep %in% as.character(subeuphotic$sample))
rownames(tmp_sample_merged) <- tmp_sample_merged$Sample_rep
rownames(tmp_sample_merged) <- colnames(txi_subeuph$counts)
# Compare CA versus NPSG at subeuphotic depths
ds_tpm_subeuphotic <- DESeqDataSetFromTximport(txi_subeuph,
colData = tmp_sample_merged,
design = ~0 + PACIFIC_REGION)
save(ds_tpm_subeuphotic, ds_tpm_euphotic, file = "/vortexfs1/scratch/sarahhu/txi-objs-metaT/euphotic-subeuphotic-deseq.RData")
# save(txi_euph, txi_subeuph, file = "/vortexfs1/scratch/sarahhu/txi-objs-metaT/light_txi.RData")By taxonomic groups - DESeq
Compare and contrast within regions. For part of our analysis we want to subset only the euphotic zone samples. Specifically targeting the difference between the SPOT station surface compared to: NPSG surface, NPSG DCM, NPSG 150m, Port of LA, and Catalina.
Question 1
For the first of our two core questions, we want to know what genes are significantly differentially expressed among the euphotic zone samples for the dominant protistan phytoplankton. We are specifically focusing on diatoms, chlorophytes, archaeplastidia (chlorophya), dinoflagellates, and haptophytes for this.
How does nutrient utilization among euphotic zone phytoplankton vary between coastal California and the NPSG?
/scripts/run_euphotic_deseq_taxa.R
# Function to subset txi by individual taxa and perform DESeq
deseq_region_bytax <- function(sample_set, gene_set){
# First incorporate the txi subset fxn
subsetTxi <- function(txi, samples, include_genes=rownames(txi$counts))
{
genes <- rownames(txi$counts)[rownames(txi$counts) %in% include_genes]
txi$abundance <- txi$abundance[genes, samples$sample]
txi$counts <- txi$counts[genes, samples$sample]
txi$length <- txi$length[genes, samples$sample]
return(txi)
}
# Run subset and sample merge re-set
txi_output <- subsetTxi(txi, sample_set, gene_set)
# This script will keep replacing tmp_sample_merged
tmp_sample_merged <- sample_merged %>%
filter(Sample_rep %in% as.character(sample_set$sample))
rownames(tmp_sample_merged) <- tmp_sample_merged$Sample_rep
rownames(tmp_sample_merged) <- colnames(txi_output$counts)
#
# Import as DESeq object - use PACIFIC_REGION in the design
# DESeq
ds_tpm_output <- DESeqDataSetFromTximport(txi_output,
colData = tmp_sample_merged,
design = ~0 + PACIFIC_REGION)
# return(ds_tpm_output)
# Further process DESeq
groupsize <- 2 # Transcript to consider, must be in at least 3 samples
keep <- rowSums(counts(ds_tpm_output) >= 10) >= groupsize # And have >= to 10 counts
ds_tpm_output_filtered <- ds_tpm_output[keep,]
###
# Filtering stats:
cat("\nStarted with ", dim(ds_tpm_output)[1], "observations. Filtering by 2 samples and 10 counts resulted in,", dim(ds_tpm_output_filtered)[1], ", which is", (100*(dim(ds_tpm_output_filtered)[1]/dim(ds_tpm_output)[1])), "% of the data.\n\n")
###
#
## Positive log fold change == up regulared in CA, compared to NPSG
ds_tpm_output_filtered$PACIFIC_REGION <- factor(ds_tpm_output_filtered$PACIFIC_REGION, levels = c("NPSG", "CA"))
de_tax_output <- DESeq2::DESeq(ds_tpm_output_filtered)
resultsNames(de_tax_output)
summary(de_tax_output)
return(de_tax_output)
cat("\n\nCompleted.\n\n")
}## Apply to each taxa
# From the euphotic samples, what are DE genes among haptophytes between CA and NPSG?
cat("\n\nDiatoms start.\n\n")
de_diatom <- deseq_region_bytax(euphotic, diatom)
cat("\n\Haptophytes start.\n\n")
# From the euphotic samples, what are DE genes among haptophytes between CA and NPSG?
de_hapto <- deseq_region_bytax(euphotic, hapto)
cat("\n\Dinoflagellates start.\n\n")
# From the euphotic samples, what are DE genes among Dinoflagellates between CA and NPSG?
de_dinos <- deseq_region_bytax(euphotic, dinos)
cat("\n\Chlorophytes start.\n\n")
# From the euphotic samples, what are DE genes among chlorophytes between CA and NPSG?
de_chloro <- deseq_region_bytax(euphotic, chloro)
save(de_diatom, de_hapto, de_dinos, de_chloro, file = "/scratch/group/hu-lab/pacocean-metaT/Robjs/euphotic_by_taxa.RData")Question 2
We can repeat the above approach, but select different taxa.
Do we see a similar euphotic vs. sub-euphotic shift in the metabolic potential of taxa observed at coastal California and the NPSG?
Use scripts/run_bydepth_deseq_taxa.R.
Curated genes only
Here, we isolated annotated transcripts that we curated ourselves.
library(tximport)
library(tidyverse)
all_samples <- metadata %>%
select(sample = SAMPLE)
txi_curated <- subsetTxi(txi, all_samples, genes_kegg_curated)
counts_scaled_curated <- makeCountsFromAbundance(
as.matrix(txi_curated$counts),
as.matrix(txi_curated$abundance),
as.matrix(txi_curated$length),
countsFromAbundance = "scaledTPM"
)
counts_df_curated <- as.data.frame(counts_scaled_curated)
names_orig <- colnames(counts_df_curated)
names_new <- sub("_[^_]+$", "", names_orig)
colnames(counts_df_curated) <- names_new
mean_counts_df_curated <- counts_df_curated %>%
cbind(as.list(.) %>%
Filter(is.numeric, .) %>%
split(names(.)) %>%
lapply(as.data.frame) %>%
lapply(rowMeans) %>%
setNames(paste0("mean.", names(.)))) %>%
select(starts_with("mean"))
counts_curated <- mean_counts_df_curated %>%
rownames_to_column(var = "SequenceID") %>%
left_join(taxfxn)
save(counts_curated, sample_merged, file = "Avg_scaled_tpm_curated_08252023.RData")tmp_sample_merged <- sample_merged %>%
filter(Sample_rep %in% as.character(all_samples$sample))
rownames(tmp_sample_merged) <- tmp_sample_merged$Sample_rep
rownames(tmp_sample_merged) <- colnames(txi_curated$counts)
ds_tpm_curated <- DESeqDataSetFromTximport(txi_curated,
colData = tmp_sample_merged,
design = ~0 + SAMPLENAME)
vsd_all <- vst(ds_tpm_curated)
df_ctr_norm_curated <- as.data.frame(assay(vsd_all))
ctr_norm_curated <- df_ctr_norm_curated %>%
rownames_to_column(var = "SequenceID") %>%
left_join(taxfxn)save(ctr_norm_curated, ds_tpm_curated, file = "normed_center_df_curated_08252023.RData")July versus March for NPSG
library(tidyverse)
library(DESeq2)
load("/scratch/group/hu-lab/pacocean-metaT/Robjs/npsg-deseq.RData", verbose = TRUE)
# Compare July vs. March in the NPSG
# ds_tpm_npsg_month <- DESeqDataSetFromTximport(txi_npsg,
# colData = tmp_sample_merged,
# design = ~0 + MONTH)How many genes are differentially expressed between July and March?
## Positive log fold change == up regulated in July, compared to March
ds_tpm_npsg_month$MONTH <- factor(ds_tpm_npsg_month$MONTH, levels = c("March", "July"))
de_month <- DESeq(ds_tpm_npsg_month)
results_month <- results(de_month, alpha=0.05)
# plotMA(results_month)
out <- summary(results_month)
write.csv(out, file = "summary-months.csv")
# Plot log fold change
svg("rplot-month.svg", width = 9, height = 8)
data.frame(results_month) %>%
mutate(REGULATION = case_when(
log2FoldChange > 0 ~ "upregulated in July",
log2FoldChange < 0 ~ "upregulated in March"
),
SIGNIFICANT = case_when(
pvalue <= 0.05 ~ "Significantly",
TRUE ~ "Not significantly"
)) %>%
ggplot(aes(x = baseMean, y = log2FoldChange, color = SIGNIFICANT)) +
geom_point(stat = "identity") +
scale_x_log10() +
theme_classic() +
scale_color_manual(values = c("#878787", "#d73027")) +
labs(title = mcols(results_month)$description[2])
dev.off()
# Get table
month_transcripts <- data.frame(results_month) %>%
mutate(REGULATION = case_when(
log2FoldChange > 0 ~ "upregulated in July",
log2FoldChange < 0 ~ "upregulated in March"
),
SIGNIFICANT = case_when(
pvalue <= 0.05 ~ "Significantly",
TRUE ~ "Not significantly"
)) %>%
filter(SIGNIFICANT == "Significantly") %>%
rownames_to_column(var = "SequenceID")
write.csv(month_transcripts, file = "/scratch/group/hu-lab/pacocean-metaT/month-transcripts.csv")