Location for raw files: /scratch/group/hu-lab/meta-eukomics/raw-data Individual R1 and R2 files are 2.9 and 2.7 G, respectively.
1.1 Background reading
Cohen NR, Alexander H, Krinos AI, Hu SK, Lampe RH. Marine Microeukaryote Metatranscriptomics: Sample Processing and Bioinformatic Workflow Recommendations for Ecological Applications. Frontiers in Marine Science 2022; 9.
Krinos AI, Cohen NR, Follows MJ, Alexander H. Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly. BMC Bioinformatics 2023; 24: 74.
1.2 Working environment
High performance computer (HPC) hosted by my University. This is the TAMU HPRC.
2 Step-by-step
2.1 1. Determine assembly groups
When we have more than one sample, it is best to “group” samples for the assembly steps. Often this takes a few attempts and is a balance of an ideal plan vs. how much computational power you may have.
2.2 2. Trim & QC
Initial Fastqc
Code for slurm script run on the HPC to run fastqc (fastqc.slurm). For this set of samples, it took 10 minutes and used 290 MB. CPU used: 00:12:27.
Then we need to trim the individual reads, removing any sequencing-based primers, etc. We can use the program, Trimmomatic for this. R1 reads are forward and R2 are reverse. The Trimmomatic software requires input reads and then output trimmed and unpaired reads (the latter of which are discarded).
Another input file required is a list of the possible primers and adapters from sequencing, adapters-primers.fa. Navigate to the github repo for this project, save a copy of slurm-scripts/adapters-primers.fa to the same location that you’re running your Trimmomatic code.
Slurm script, trim_fastqc.slurm. Below, trim parameter include:
Remove adapters, found in adapters-primers.fa (make sure you save a copy of this file)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 10 (SLIDINGWINDOW:4:10)
Drop reads shorter than 50 bases long (MINLEN:50)
Code
module load Trimmomatic/0.39-Java-11java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_001.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_001.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_unpaired.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_unpaired.fastq.gz ILLUMINACLIP:adapters-primers.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:10 MINLEN:50# "$EBROOTTRIMMOMATIC/trimmomatic-0.39.jar" note that this is specific to the HPRC system we are usingecho "Trimmomtatic complete. Repeating fastqc"module load FastQC/0.11.9-Java-11fastqc /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gzfastqc /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz
The above trimming steps took 615 MB of memory and 2 hours.
2.3 3. Remove ribosomal RNA
Sortmerna installation guidelines and usage can be found here.
On the TAMU HPRC, this is where we needed to make a conda environment to run this.
Answer “yes” (Y) to installation questions. Now when you run SortMeRNA, you need to activate this conda environment and execute your code. The next step is to download the databases to use for this program.
SortMeRNA databases
Separately, you need to download the databases to “align” and check the sequences for ribosomal RNAs. See sortmerna-db-build.slurm.
# Move to the location you want your rRNA databases storedcd /scratch/group/hu-lab/meta-eukomics/# Download from github release wget https://github.com/biocore/sortmerna/releases/download/v4.3.4/database.tar.gz# Make new directory for this and un-zip downloaded filemkdir rRNA_databases_v4tar -xvf database.tar.gz -C rRNA_databases_v4
Run SortMeRNA (optional)
In your slurm script, activate conda, and then the sortmerna environment (sortmeRNA.slurm):
-paired_in: The option ‘paired_in’ is optimal for users who want all reads in the other.fasta file to be non-rRNA. However, there are small chances that reads which are non-rRNA will also be put into the aligned.fasta file.
Since the SortmeRNA tool is optimized to take things that are aligned with the rRNA databases and put them in a file called “aligned”, then we need to use the other.fasta output file. The argument for this is -out2
Necessary flags for retaining the non-rRNA aligned reads and outputting them as separate R1 and R2 files: -other, -fastx, --paired_in, and --out2. To learn more about this check out this thread: How do I properly run sortmerna? https://github.com/sortmerna/sortmerna/issues/333
This was run with 32 threads and 1 node. It took 2 hours 45 minutes and used just over 16 GB of memory.
The total number of reads that aligned to the rRNA databases was pretty small. I am planning to not use the Sorted reads, as there are some issue with the program. Will look for another way to sort reads.
2.4 4. Assembly
The next step is to take all the trimmed reads and bring them together to make longer, more continuous sequences, called contigs. Here, we will use two assemblers and combine the results. Each one is built slightly differently.
MEGAHIT (assembly 1)
We will first use megahit. To save the assemblies separately, make a new assembly output file in your working scratch space. mkdir /scratch/group/hu-lab/meta-eukomics/assembly
The megahit command below, outputs contigs in the new assembly directory, only keeps reads longer than 100 bps, and uses the megahit preset for lots of diversity in a sample (meta-large).
Megahit uses multiple k-mer strategy, and we can set the min and max k. In order to reduce the complexity of the de Bruijin graph, a kmin size of 25 and higher (like 27) is better.
With the above settings, megahit recovered 31,331 contigs, total 44610184 bp, min 1000 bp, max 10674 bp, avg 1423 bp, N50 1391 bp. And this took about 10 hours.
On the HPC, this was run on 1 node, 4 cores per node, and it used 18.5 GBs of memory.
When megahit assembly is complete, use this option to visualize it: https://github.com/voutcn/megahit/wiki/Visualizing-MEGAHIT’s-contig-graph
IDBA-Tran (assembly 2)
IDBA-tran is another de novo assembler. Uses local assembly to reconstruct missing kmers in low-expressed transcripts.
See script: idba-assembly.slurm to run assembly with minimum kmer at 20 and max kmer at 50 with a 5 step increment of kmer.
Code
echo "unzip files"gunzip /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R*_trimmed.fastq.gzmodule load GCC/8.2.0-2.31.1module load IDBA-UD/1.1.3# Convert fastq R1 and R2 files into a single 'interleaved` fastq filesfq2fa --merge --filter /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq /scratch/group/hu-lab/meta-eukomics/assembly/merged-PE-for-idba.fa# Run assemblyidba_tran -r /scratch/group/hu-lab/meta-eukomics/assembly/merged-PE-for-idba.fa -o /scratch/group/hu-lab/meta-eukomics/assembly/idba-out --mink 20--maxk 50--step 5--num_threads 16# Make sure fastq files are re-zipped:gzip /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastqgzip /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq
Using 8 cores, the IDBA-tran assembly took about 23 hours and used 130 GB of memory.
The IDBA-tran output includes separation by k-mer length. So there are multiple “contig-X.fa” files. Where contig-25.fa is the output contigs at k-mer = 25. The IDBA program determines which one is best and then puts this into the final contig.fa file. Output: /scratch/group/hu-lab/meta-eukomics/assembly/idba-out/contig.fa
2.5 5. Evaluate assemblies
QUAST
QUAST stands for QUality ASsessment Tool and it is a tool for evaluating assemblies. Not all tools for evaluating assemblies will work, as most are built for metagenomics or require a reference genome. In this case, we do not have a reference genome.
IDBA output a lot more contigs, but they are shorter. Megahit has many fewer (31,331 / 105,801) of the total contigs. However, contigs over 1000 bps from IDBA drops to only 13K. Demonstrating that most of them are between 500 - 1000 bps (because the program only reports contigs greater than 500 bps). Megahit, on the otherhand, has over 31k contigs that are greater than 1000 bps in length. For super long contigs, we get 27 from megahit and only 16 from idba (>5K bps).
As a result, we do get more of the original transcripts mapping to the IDBA output. but the average coverage is 27. Megahit has a few points fewer for % mapped, but the average coverage depth is 47.
These assemblies can be combined below.
2.6 6. Cluster assembly output (mmseqs2)
Because we have outputs from two separate assemblies, we can use a program to “cluster” the output contigs together. The clustering / merging of contigs is based on sequence similarity, so we need to include a similarity parameter that is kind of arbitrary.
We will use mmseqs2 and try this at 99% similarity. MMseq stands for “Many-against-Many sequence searching”
First, the output contig files from IDBA and Megahit are combined: cluster-mmseqs2.slurm
Code
# Precursor load to use MMseqs2 on HPRCmodule load GCC/10.2.0module load OpenMPI/4.0.5module load MMseqs2/13-45111mkdir /scratch/group/hu-lab/meta-eukomics/assembly/combined-assemblycat /scratch/group/hu-lab/meta-eukomics/assembly/megahit-output/final.contigs.fa /scratch/group/hu-lab/meta-eukomics/assembly/idba-out/contig.fa >/scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta# Use the 'easy-linclust' optionmmseqs easy-linclust /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/clusterRes /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/tmp --threads=$SLURM_CPUS_PER_TASK
If you’re working with a massive amount of data, this is a good place to try reducing the percent identity with mmseqs down to 99% or 98%.
32 cores (1 node), this took just under 1 hour and 1.3 GB of memory.
6.1 QUAST new combined assembly
Repeat assembly evaluation with quast (quast-combined.slurm)
Now, with a combined assembly we get a better deal with our contigs. We have more like 137,132 contigs. This is more than both IDBA and megahit, but we are now getting more than 45K greater than 1000 bp, and 43 greater than 5K bps. And now, the percent mapped from original reads is higher at 31%, but our average coverage depth drops to 21. This is fine because we can map more reads to our assemblies.
2.7 7. Transdecoder
From the combined assemblies, we need to extract the long open reading frames (ORFs). ORFs will be identified that have at least 100 aminos acids. Then we also follow this with predicting the likely coding regions so we can continue with protein annotation.
transdecoder.slurm:
Code
module load GCC/11.3.0module load TransDecoder/5.5.0#extract the long open reading framesTransDecoder.LongOrfs -t /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta --output_dir /scratch/group/hu-lab/meta-eukomics/predictions/# predict the likely coding regionsTransDecoder.Predict -t /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta --output_dir /scratch/group/hu-lab/meta-eukomics/predictions/
At 16 cores (1 node), job took 5.5 GB of memory and 1 hour.
7.1 (optional) MMSeq2 part II
At this point, if you have a LOT of samples and large amounts of data, a repeat clustering of the proteins will help reduce the amount of data you’re working with.
2.8 8. Key output files
At this point, you have assembled, quality checked, and predicted proteins for all your metatranscriptomes.
Next steps: annotate above files, based on predicted proteins or the contigs and determine transcript abundance by mapping the original trimmed fastq reads back onto these contigs.
2.9 9. Annotation
The goal of the below steps is to match assembled reads with taxonomy and protein databases.
For this we will use Diamond. This is a really fast tool for querying nucleotide or protein sequences again a database. You need to first make a diamond database (.dmnd), and then use blastp or something to
To use the Marferrt resource, we need to query our combined assembly and predicted protein IDs with the Marferret this with diamond.
In order to run Diamond (or any similar blast-like program), we need to consider a few variables & terms.
e-value (expectation value): This is the number of hits that could be found by chance. So an e-value of 50, means up to 50 of the matches (or hits) in your results could be a result of chance. Therefore, lower e-values mean you will get better matches, or matches of better quality. Generally, these programs may have e-value defaults of 10, which may be helpful for looking at all possible results. But an e-value of 0.01 would be better to use to look for good matches. There is a formula for how we calcuate e-values, and it depends on the size of the query sequences and the databases.
Bit score: This is the result of a log2 scaled and normalized score of the number of matches that could be found by chance based on the database size.
Diamond default e-value is 0.001, I am changing mine to 0.0001. And I’m using the --sensitive setting which finds significant matches with >50 bits for fragments that are between 30-40 aa.
With 32 cores (1 node), this took 20 minutes and 10 GB of memory.
2.10 10. Transcript counts
For determining “gene expression”, we estimate this from the number of transcripts (sequenced reads) that map onto your assembled reads. We will use salmon.
First we need to index your transcripts - salmon uses a quasi-map approach to quantify the reads.
At 32 cores (1 node), transcript counts took 3.5 hours and 4.2 GB of memory.
Read more here, especially on how to loop through samples: https://combine-lab.github.io/salmon/getting_started/#indexing-txome
2.11 11. Compile outputs
Annotation from marferret: /scratch/group/hu-lab/meta-eukomics/annotation/meta-eukomic_marferret_07082024.txt Column 1 is the contig ID name (qseqid) and the second ID is the match ID from the database (sseqid). These two columns need to be put into a new dataframe and called “TRANSCRIPTID” and “GENEID”
For the GENEIDs, the “mftXXXXXXXXX” names correspond to pfam IDs in: /scratch/group/hu-lab/marferret/data/MarFERReT.v1.1.1.best_pfam_annotations.csv.gz
** Use MarFERReT.v1.1.1.taxonomies.tab.gz to get the tax id and then line this up with MarFERReT.v1.1.1.metadata.csv to get the actual taxonomy.
See scripts/compile-metat-results.R. This script will output an R object called tx_gene_KEY.
11.1 Transcipt counts
For obtaining transcript-level estimates from the salmon count files, we need to use the R library tximport. Review the manual here.
Salmon output should be /meta-eukomics/salmon-quant/quant/quant.sf
From the compile-metat-results.R code, you can import this tx_gene_KEY object to use in the tximport estimation of TPMs. First, you need to run the command tximport.
This should output a txi object called txi_metat and then you can create a paired sample table.
See scripts/format-output-tables.R. This takes the txi objects and uses the command makeCountsFromAbundance() to generate a TPM file that represents the transcript-length corrected estimates.
Then the script imports the annotation file, merges it with the count information to create a large R object with TPMs and annotation information.
SortMeRNA: Kopylova E., Noe L. and Touzet H., “SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611.
Salmon: Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.
QUAST: Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29: 1072 1075. doi:10.1093/bioinformatics/btt086 , https://www.ncbi.nlm.nih.gov/pubmed/23422339
4 eukrhythmic
Guidelines and notes on running the same set of samples through the eukrhythmic pipeline. Running this on the TAMU HPRC with a conda environment.
conda create -n eukrhythmic
# Activate environment
conda activate eukrhythmic
# Use HPRC system to ensure newest version of available Python is loaded
> (eukrhythmic) module load GCCcore/12.3.0
> (eukrhythmic) module load Python/3.11.3
To check what version of python you are automatically running, use python --version.
To run eukrhythmic, activate the environment: conda activate eukrhythmic.
Source Code
---title: "Meta-eukomic"format: html: toc: true toc-location: left number-sections: true number-depth: 2editor: visual---# Meta-eukomic*SKH* last updated September 2024[Github repo for this code](https://github.com/shu251/meta-eukomic/tree/main)Location for raw files: `/scratch/group/hu-lab/meta-eukomics/raw-data` Individual R1 and R2 files are 2.9 and 2.7 G, respectively.## Background reading1. Cohen NR, Alexander H, Krinos AI, Hu SK, Lampe RH. Marine Microeukaryote Metatranscriptomics: Sample Processing and Bioinformatic Workflow Recommendations for Ecological Applications. Frontiers in Marine Science 2022; 9.2. Krinos AI, Cohen NR, Follows MJ, Alexander H. Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly. BMC Bioinformatics 2023; 24: 74.## Working environmentHigh performance computer (HPC) hosted by my University. This is the TAMU HPRC.# Step-by-step## 1. Determine assembly groupsWhen we have more than one sample, it is best to "group" samples for the assembly steps. Often this takes a few attempts and is a balance of an ideal plan vs. how much computational power you may have.## 2. Trim & QC*Initial Fastqc*Code for slurm script run on the HPC to run fastqc (`fastqc.slurm`). For this set of samples, it took 10 minutes and used 290 MB. CPU used: 00:12:27.``` module load FastQC/0.11.9-Java-11fastqc /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_001.fastq.gzfastqc /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_001.fastq.gz```In order to look at the output `.html` files, they need to be opened locally.``` scp $HPRC-ADDRESS:/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_001_fastqc.html meta-eukomic/output-files/```Then we need to trim the individual reads, removing any sequencing-based primers, etc. We can use the program, [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) for this. R1 reads are forward and R2 are reverse. The Trimmomatic software requires input reads and then output trimmed and unpaired reads (the latter of which are discarded).Another input file required is a list of the possible primers and adapters from sequencing, `adapters-primers.fa`. Navigate to [the github repo for this project](https://github.com/shu251/meta-eukomic), save a copy of `slurm-scripts/adapters-primers.fa` to the same location that you're running your Trimmomatic code.Slurm script, `trim_fastqc.slurm`. Below, trim parameter include:- Remove adapters, found in `adapters-primers.fa` (*make sure you save a copy of this file*)- Remove leading low quality or N bases (below quality 3) (LEADING:3)- Remove trailing low quality or N bases (below quality 3) (TRAILING:3)- Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 10 (SLIDINGWINDOW:4:10)- Drop reads shorter than 50 bases long (MINLEN:50)```{r}module load Trimmomatic/0.39-Java-11java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_001.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_001.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_unpaired.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_unpaired.fastq.gz ILLUMINACLIP:adapters-primers.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:10 MINLEN:50# "$EBROOTTRIMMOMATIC/trimmomatic-0.39.jar" note that this is specific to the HPRC system we are usingecho "Trimmomtatic complete. Repeating fastqc"module load FastQC/0.11.9-Java-11fastqc /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gzfastqc /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz```> The above trimming steps took 615 MB of memory and 2 hours.## 3. Remove ribosomal RNA[Sortmerna](https://sortmerna.readthedocs.io/en/latest/installation.html) installation guidelines and usage can be found here.On the TAMU HPRC, this is where we needed to make a conda environment to run this.```{r}# Load anacondamodule load Anaconda3/2022.05# modify bioconda config for sortmeRNAconda config --add channels defaultsconda config --add channels biocondaconda config --add channels conda-forgeconda config --set channel_priority strict```Search for sortmerna using conda:```{r}conda search sortmerna```Create a conda environment specific for sortmerna, activate it, and install it.```{r}conda create --name sortmerna_envconda activate sortmerna_envconda install sortmerna```Answer "yes" (*Y*) to installation questions. Now when you run SortMeRNA, you need to activate this conda environment and execute your code. The next step is to download the databases to use for this program.#### SortMeRNA databasesSeparately, you need to download the databases to "align" and check the sequences for ribosomal RNAs. See `sortmerna-db-build.slurm`.Instructions for [downloading version 4](https://sortmerna.readthedocs.io/en/latest/databases.html):```{r}# Move to the location you want your rRNA databases storedcd /scratch/group/hu-lab/meta-eukomics/# Download from github release wget https://github.com/biocore/sortmerna/releases/download/v4.3.4/database.tar.gz# Make new directory for this and un-zip downloaded filemkdir rRNA_databases_v4tar -xvf database.tar.gz -C rRNA_databases_v4```### Run SortMeRNA (*optional*)In your slurm script, activate conda, and then the sortmerna environment (`sortmeRNA.slurm`):```{r}# Load anacondamodule load Anaconda3/2022.05conda activate sortmerna_envmkdir /scratch/group/hu-lab/meta-eukomics/rRNA-sortsortmerna -ref /scratch/group/hu-lab/meta-eukomics/rRNA_databases_v4/smr_v4.3_sensitive_db.fasta \-reads /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz \-reads /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz \-workdir /scratch/group/hu-lab/meta-eukomics/rRNA-sort/-aligned rRNA-aligned -other non-rRNA-fastx --paired_in --out2 --threads $SLURM_CPUS_PER_TASK````-paired_in`: The option 'paired_in' is optimal for users who want all reads in the other.fasta file to be non-rRNA. However, there are small chances that reads which are non-rRNA will also be put into the aligned.fasta file.Since the SortmeRNA tool is optimized to take things that are aligned with the rRNA databases and put them in a file called "aligned", then we need to use the `other.fasta` output file. The argument for this is `-out2`Necessary flags for retaining the non-rRNA aligned reads and outputting them as separate R1 and R2 files: `-other`, `-fastx`, `--paired_in`, and `--out2`. To learn more about this check out this thread: *How do I properly run sortmerna?* https://github.com/sortmerna/sortmerna/issues/333> This was run with 32 threads and 1 node. It took 2 hours 45 minutes and used just over 16 GB of memory.> The total number of reads that aligned to the rRNA databases was pretty small. I am planning to not use the Sorted reads, as there are some issue with the program. Will look for another way to sort reads.## 4. AssemblyThe next step is to take all the trimmed reads and bring them together to make longer, more continuous sequences, called *contigs*. Here, we will use two assemblers and combine the results. Each one is built slightly differently.### MEGAHIT (assembly 1)We will first use [megahit](https://github.com/voutcn/megahit). To save the assemblies separately, make a new assembly output file in your working `scratch` space. `mkdir /scratch/group/hu-lab/meta-eukomics/assembly`The megahit command below, outputs contigs in the new `assembly` directory, only keeps reads longer than 100 bps, and uses the megahit preset for lots of diversity in a sample (*meta-large*).Megahit uses multiple k-mer strategy, and we can set the min and max k. In order to reduce the complexity of the *de Bruijin* graph, a kmin size of 25 and higher (like 27) is better.See slurm script: `megahit-assembly.slurm````{r}module load GCCcore/11.2.0module load MEGAHIT/1.2.9megahit -1/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz -2/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz -o /scratch/group/hu-lab/meta-eukomics/assembly/--min-contig-len 1000--presets meta-large --num-cpu-threads 32```> With the above settings, megahit recovered 31,331 contigs, total 44610184 bp, min 1000 bp, max 10674 bp, avg 1423 bp, N50 1391 bp. And this took about *10 hours*.> On the HPC, this was run on 1 node, 4 cores per node, and it used 18.5 GBs of memory.Megahit assembly location: `/scratch/group/hu-lab/meta-eukomics/assembly/megahit-output/final.contigs.fa`*Optional*: Visualize your megahit assemblyWhen megahit assembly is complete, use this option to visualize it: https://github.com/voutcn/megahit/wiki/Visualizing-MEGAHIT's-contig-graph### IDBA-Tran (assembly 2)[IDBA-tran](https://i.cs.hku.hk/~alse/hkubrg/projects/idba_tran/index.html) is another *de novo* assembler. Uses local assembly to reconstruct missing kmers in low-expressed transcripts.See script: `idba-assembly.slurm` to run assembly with minimum kmer at 20 and max kmer at 50 with a 5 step increment of kmer.```{r}echo "unzip files"gunzip /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R*_trimmed.fastq.gzmodule load GCC/8.2.0-2.31.1module load IDBA-UD/1.1.3# Convert fastq R1 and R2 files into a single 'interleaved` fastq filesfq2fa --merge --filter /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq /scratch/group/hu-lab/meta-eukomics/assembly/merged-PE-for-idba.fa# Run assemblyidba_tran -r /scratch/group/hu-lab/meta-eukomics/assembly/merged-PE-for-idba.fa -o /scratch/group/hu-lab/meta-eukomics/assembly/idba-out --mink 20--maxk 50--step 5--num_threads 16# Make sure fastq files are re-zipped:gzip /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastqgzip /scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq```> Using 8 cores, the IDBA-tran assembly took about 23 hours and used 130 GB of memory.The *IDBA-tran output* includes separation by k-mer length. So there are multiple "contig-X.fa" files. Where `contig-25.fa` is the output contigs at k-mer = 25. The IDBA program determines which one is best and then puts this into the final `contig.fa` file. Output: `/scratch/group/hu-lab/meta-eukomics/assembly/idba-out/contig.fa`## 5. Evaluate assemblies### QUASTQUAST stands for *QUality ASsessment Tool* and it is a tool for evaluating assemblies. Not all tools for evaluating assemblies will work, as most are built for metagenomics or require a reference genome. In this case, we do not have a reference genome.*First for the IDBA output* `quast-idba.slurm`:```{r}module load GCC/9.3.0module load OpenMPI/4.0.3module load QUAST/5.0.2-Python-3.8.2quast.py /scratch/group/hu-lab/meta-eukomics/assembly/idba-out/contig.fa \-1/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz -2/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz \-o /scratch/group/hu-lab/meta-eukomics/assembly/idba-out/ \--threads $SLURM_CPUS_PER_TASK```> This was run on 1 node with 16 cores. It used 3.5 GB of memory in about 1 hour. 6:15 CPUs used.*Repeat for the megahit output* `quast-megahit.slurm`:```{r}module load GCC/9.3.0module load OpenMPI/4.0.3module load QUAST/5.0.2-Python-3.8.2quast.py /scratch/group/hu-lab/meta-eukomics/assembly/megahit-output/final.contigs.fa \-1/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz -2/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz \-o /scratch/group/hu-lab/meta-eukomics/assembly/megahit-output/ \--threads $SLURM_CPUS_PER_TASK```> At 16 cores (1 node), this took about 1 hour at 3 GB of memory.#### Interpret QUASTDownload files locally from:- PDF files: `meta-eukomics/assembly/*-out/basic_stats/*.pdf`- Use an ICARUS html viewer: `meta-eukomics/assembly/*-out/icarus*`- Basic read stats: `meta-eukomics/assembly/*-out/reads_stats/reads_report.tsv`##### Example output from QUAST*idba output*``` Assembly contig # contigs (>= 0 bp) 892652 # contigs (>= 1000 bp) 13948 # contigs (>= 5000 bp) 16 # contigs (>= 10000 bp) 0 # contigs (>= 25000 bp) 0 # contigs (>= 50000 bp) 0 Total length (>= 0 bp) 310591507Total length (>= 1000 bp) 19165345 Total length (>= 5000 bp) 101378 Total length (>= 10000 bp) 0 Total length (>= 25000 bp) 0 Total length (>= 50000 bp) 0 # contigs 105801 Largest contig 9483 Total length 78684450 GC (%) 53.19 N50 714 N75 581 L50 38334 L75 69112 # total reads 110011298# left 54322593 # right 54322593 Mapped (%) 28.29 Properly paired (%) 19.43 Avg. coverage depth 27 Coverage >= 1x (%) 99.98 # N's per 100 kbp 0.00 ```*megahit output*``` Assembly final.contigs# contigs (>= 0 bp) 31331 # contigs (>= 1000 bp) 31331 # contigs (>= 5000 bp) 27 # contigs (>= 10000 bp) 2 # contigs (>= 25000 bp) 0 # contigs (>= 50000 bp) 0 Total length (>= 0 bp) 44610184 Total length (>= 1000 bp) 44610184 Total length (>= 5000 bp) 176055 Total length (>= 10000 bp) 21295 Total length (>= 25000 bp) 0 Total length (>= 50000 bp) 0 # contigs 31331 Largest contig 10674 Total length 44610184 GC (%) 53.78 N50 1391 N75 1152 L50 12043 L75 20904 # total reads 108929544 # left 54322593 # right 54322593 Mapped (%) 24.47 Properly paired (%) 18.5 Avg. coverage depth 47 Coverage >= 1x (%) 99.98 # N's per 100 kbp 0.00 ```Notes on comparison:IDBA output a lot more contigs, but they are shorter. Megahit has many fewer (31,331 / 105,801) of the total contigs. However, contigs over 1000 bps from IDBA drops to only 13K. Demonstrating that most of them are between 500 - 1000 bps (because the program only reports contigs greater than 500 bps). Megahit, on the otherhand, has over 31k contigs that are greater than 1000 bps in length. For super long contigs, we get 27 from megahit and only 16 from idba (\>5K bps).As a result, we do get more of the original transcripts mapping to the IDBA output. but the average coverage is 27. Megahit has a few points fewer for % mapped, but the average coverage depth is 47.These assemblies can be combined below.## 6. Cluster assembly output (mmseqs2)Because we have outputs from two separate assemblies, we can use a program to "cluster" the output contigs together. The clustering / merging of contigs is based on sequence similarity, so we need to include a similarity parameter that is kind of arbitrary.We will use [mmseqs2](https://github.com/soedinglab/MMseqs2) and try this at 99% similarity. MMseq stands for "Many-against-Many sequence searching"First, the output contig files from IDBA and Megahit are combined: `cluster-mmseqs2.slurm````{r}# Precursor load to use MMseqs2 on HPRCmodule load GCC/10.2.0module load OpenMPI/4.0.5module load MMseqs2/13-45111mkdir /scratch/group/hu-lab/meta-eukomics/assembly/combined-assemblycat /scratch/group/hu-lab/meta-eukomics/assembly/megahit-output/final.contigs.fa /scratch/group/hu-lab/meta-eukomics/assembly/idba-out/contig.fa >/scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta# Use the 'easy-linclust' optionmmseqs easy-linclust /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/clusterRes /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/tmp --threads=$SLURM_CPUS_PER_TASK```If you're working with a massive amount of data, this is a good place to try reducing the percent identity with mmseqs down to 99% or 98%.> 32 cores (1 node), this took just under 1 hour and 1.3 GB of memory.### 6.1 QUAST new combined assemblyRepeat assembly evaluation with quast (`quast-combined.slurm`)```{r}quast.py /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta \-1/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz -2/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz \-o /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/ \--threads $SLURM_CPUS_PER_TASK```Now, with a combined assembly we get a better deal with our contigs. We have more like 137,132 contigs. This is more than both IDBA and megahit, but we are now getting more than 45K greater than 1000 bp, and 43 greater than 5K bps. And now, the percent mapped from original reads is higher at 31%, but our average coverage depth drops to 21. This is fine because we can map more reads to our assemblies.## 7. TransdecoderFrom the combined assemblies, we need to extract the long open reading frames (ORFs). ORFs will be identified that have at least 100 aminos acids. Then we also follow this with predicting the likely coding regions so we can continue with protein annotation.`transdecoder.slurm`:```{r}module load GCC/11.3.0module load TransDecoder/5.5.0#extract the long open reading framesTransDecoder.LongOrfs -t /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta --output_dir /scratch/group/hu-lab/meta-eukomics/predictions/# predict the likely coding regionsTransDecoder.Predict -t /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta --output_dir /scratch/group/hu-lab/meta-eukomics/predictions/```Output files: `/scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta.transdecoder.*`> At 16 cores (1 node), job took 5.5 GB of memory and 1 hour.### 7.1 (optional) MMSeq2 part IIAt this point, if you have a LOT of samples and large amounts of data, a repeat clustering of the proteins will help reduce the amount of data you're working with.## 8. Key output filesAt this point, you have assembled, quality checked, and predicted proteins for all your metatranscriptomes.- Assembled contigs: `/scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/clusterRes_rep_seq.fasta`- Predicted proteins: `/scratch/group/hu-lab/meta-eukomics/predictions` see files ending in .gff3 and .pep- `/scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta.transdecoder.*`*Next steps*: annotate above files, based on predicted proteins or the contigs and determine transcript abundance by mapping the original trimmed fastq reads back onto these contigs.## 9. AnnotationThe goal of the below steps is to match assembled reads with taxonomy and protein databases.### 9.1 Marferret & DiamondFor taxonomic annotation, using [MarFERReT](https://github.com/armbrustlab/marferret)For this we will use [Diamond](https://github.com/bbuchfink/diamond). This is a really fast tool for querying nucleotide or protein sequences again a database. You need to first make a diamond database (`.dmnd`), and then use `blastp` or something toTo use the Marferrt resource, we need to query our combined assembly and predicted protein IDs with the Marferret this with diamond.In order to run Diamond (or any similar blast-like program), we need to consider a few variables & terms.- *e-value* (expectation value): This is the number of hits that could be found by chance. So an e-value of 50, means up to 50 of the matches (or hits) in your results could be a result of chance. Therefore, lower e-values mean you will get better matches, or matches of better quality. Generally, these programs may have e-value defaults of 10, which may be helpful for looking at all possible results. But an e-value of 0.01 would be better to use to look for good matches. There is a formula for how we calcuate e-values, and it depends on the size of the query sequences and the databases.- *Bit score*: This is the result of a log2 scaled and normalized score of the number of matches that could be found by chance based on the database size.Diamond default e-value is 0.001, I am changing mine to 0.0001. And I'm using the `--sensitive` setting which finds significant matches with \>50 bits for fragments that are between 30-40 aa.```{r}module load GCC/11.2.0module load DIAMOND/2.0.15diamond blastp --threads $SLURM_CPUS_PER_TASK --query /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/combined-assemblies.fasta.transdecoder.pep --db /scratch/group/hu-lab/marferret/data/MarFERReT.v1.1.1.dmnd -o /scratch/group/hu-lab/meta-eukomics/annotation/meta-eukomic_marferret_07082024.txt --sensitive --evalue 0.0001--max-target-seqs 10--outfmt 6 qseqid sseqid sallseqid evalue bitscore qlen slen```> With 32 cores (1 node), this took 20 minutes and 10 GB of memory.## 10. Transcript countsFor determining "gene expression", we estimate this from the number of transcripts (sequenced reads) that map onto your assembled reads. We will use [salmon](https://combine-lab.github.io/salmon/#:~:text=Salmon%20is%20a%20tool%20for,and%20while%20using%20little%20memory.).First we need to index your transcripts - salmon uses a quasi-map approach to quantify the reads.```{r}module load GCC/11.2.0module load OpenMPI/4.1.1module load Salmon/1.7.0mkdir /scratch/group/hu-lab/meta-eukomics/salmon-quantsalmon index -t /scratch/group/hu-lab/meta-eukomics/assembly/combined-assembly/clusterRes_rep_seq.fasta -i /scratch/group/hu-lab/meta-eukomics/salmon-quant/-p $SLURM_CPUS_PER_TASK```> Indexing this data took 45 minutes (32 cores, 1 node) and 1.2 GB of memory```{r}salmon quant -i /scratch/group/hu-lab/meta-eukomics/salmon-quant/-l A \-1/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R1_trimmed.fastq.gz \-2/scratch/group/hu-lab/meta-eukomics/raw-data/HS039_S90_L004_R2_trimmed.fastq.gz \-p $SLURM_CPUS_PER_TASK --validateMappings -o /scratch/group/hu-lab/meta-eukomics/salmon-quant/quant```> At 32 cores (1 node), transcript counts took 3.5 hours and 4.2 GB of memory.Read more here, especially on how to loop through samples: https://combine-lab.github.io/salmon/getting_started/#indexing-txome## 11. Compile outputs- Annotation from marferret: `/scratch/group/hu-lab/meta-eukomics/annotation/meta-eukomic_marferret_07082024.txt` Column 1 is the contig ID name (qseqid) and the second ID is the match ID from the database (sseqid). These two columns need to be put into a new dataframe and called "TRANSCRIPTID" and "GENEID"- For the GENEIDs, the "mftXXXXXXXXX" names correspond to pfam IDs in: `/scratch/group/hu-lab/marferret/data/MarFERReT.v1.1.1.best_pfam_annotations.csv.gz`\*\* Use `MarFERReT.v1.1.1.taxonomies.tab.gz` to get the *tax id* and then line this up with `MarFERReT.v1.1.1.metadata.csv` to get the actual taxonomy.See `scripts/compile-metat-results.R`. This script will output an R object called `tx_gene_KEY`.### 11.1 Transcipt countsFor obtaining transcript-level estimates from the salmon count files, we need to use the R library `tximport`. Review [the manual here](https://github.com/thelovelab/tximport?tab=readme-ov-file).Salmon output should be `/meta-eukomics/salmon-quant/quant/quant.sf`From the `compile-metat-results.R` code, you can import this `tx_gene_KEY` object to use in the `tximport` estimation of TPMs. First, you need to run the command `tximport`.This should output a *txi* object called `txi_metat` and then you can create a paired sample table.See the Rscript `scripts/tximport_run.R````{r}## Example:# library(tximport)# txi <- tximport::tximport(files, type = "salmon", tx2gene = tx2gene_in)```### 11.2 Get TPMs & Annotation informationSee `scripts/format-output-tables.R`. This takes the txi objects and uses the command `makeCountsFromAbundance()` to generate a TPM file that represents the transcript-length corrected estimates.Then the script imports the annotation file, merges it with the count information to create a large R object with TPMs and annotation information.### 11.3 Reformat for hackathon!```{r}load("/scratch/group/hu-lab/meta-eukomics/counts_metat_df_annot.RData", verbose =TRUE)glimpse(counts_metat_df_annot)```# Metatranscriptome run information| Software | Version ||--------------|----------|| Fastqc | 0.11.9 || Trimmomatic | 0.39 || SortMeRNA | 4.3.7 || megahit | 1.2.9 || IDBA-Tran | 1.1.3 || quast | 5.0.2 || mmseqs2 | 13-45111 || TransDecoder | 5.5.0 || Salmon | 1.7.0 || Diamond | 2.0.15 || | |#### Citations- \[fastqc\]- \[trimmomatic\]- SortMeRNA: Kopylova E., Noe L. and Touzet H., "SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data", Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611.- \[megahit\]- [IDBA-Tran](https://academic.oup.com/bioinformatics/article/29/13/i326/191893)- Salmon: Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.- QUAST: Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29: 1072 1075. doi:10.1093/bioinformatics/btt086 , https://www.ncbi.nlm.nih.gov/pubmed/23422339# eukrhythmicGuidelines and notes on running the same set of samples through the [eukrhythmic pipeline](https://eukrhythmic.readthedocs.io/en/latest/index.html). Running this on the TAMU HPRC with a conda environment.First create a conda environment specific for eukrhythmic. [Learn more about how to manage these environments here.](https://docs.conda.io/projects/conda/en/4.6.1/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands)``` conda create -n eukrhythmic# Activate environmentconda activate eukrhythmic# Use HPRC system to ensure newest version of available Python is loaded> (eukrhythmic) module load GCCcore/12.3.0> (eukrhythmic) module load Python/3.11.3```To check what version of python you are automatically running, use `python --version`.- I could not get mamba configured correctly?``` > (eukrhythmic) conda install conda-forge::pyyaml> (eukrhythmic) conda install -c conda-forge python pandas# Install snakemake, but use mamba> (eukrhythmic) conda install -c conda-forge mambamamba install -c conda-forge -c bioconda snakemake```To run eukrhythmic, activate the environment: `conda activate eukrhythmic`.```{r}```------------------------------------------------------------------------