Data science

Despite their ecological importance, efforts to catalog protistan physiological abilities and trophic strategies lag behind those of their prokaryotic counterparts. The interpretation of large sequence datasets relies on the knowledge gleaned from culture-based transcriptome studies, new developments in genetic probing of microorganisms, and single-cell sequencing. It is imperative that we build reproducible pipelines and infrastructure to keep up with growing sequence information.

1 Computational biology

1.1 Tag-sequencing

To address questions related to microeukaryotic biodiversity, we use 18S rRNA gene tag-sequencing (also called metabarcoding). Below are resources for analyzing this type of data.

Our most recent tag-sequencing work flow was updated in 2024.

Check my github for in process work, where we are always working to update our code.

Also see the BVCN resources for amplicon sequencing

1.2 Time-series data

When we work with any metabarcoding data, it is important to acknowledge and analyze it as compositional data. As part of our R code clubs we always emphasize the inclusion of diagnostic steps to make sure our data is being analyzed appropriately.

A product from a series of discussions with colleagues is resource for how to analyze time-series data.. When paired with the publication it is a good reminder and starting point when analyzing this type of dataset.

Additional resources include the code and publication, called A Hard Day’s Night, which explores temporal niche partitioning among Station ALOHA microeukaryotic communities.

1.3 Metatranscriptomics

To interrogate in situ microeukaryotic metabolic activity, we can target the messengar RNA (mRNA) for sequencing through metatranscriptomics.

Our most recent introductory tutorial on metatranscriptomics, especially for eukaryotes, can be found here.

1.3.1 Computational tools

An active area of our research includes developing solutions to the computational challenges presented by eukaryotic genetics, a part of this involves creating tools to automate the assembly, annotation, and processing of meta’omic (metagenome/metatranscriptome) data (EUKulele, EUKHeist) and augment how sequence reference databases are queried (see Krinos et al. 2024).

EUKHeist code and publication

EUKulele documentation and publication.

1.3.2 Building reproducible pipeline for amplicon sequence analysis

Instructions for implementing a snakemake workflow to make tag-sequencing pipeline more robust HERE

Building/prepping a database for use in QIIME2

1.3.3 Metaeukomic

Members of the Hu lab are contributing to community-wide efforts to examine the range of variability across metatranscriptome bench and computational pipelines. This an intercalibration effort funded by OCB, Metaeukomic.

2 Database Darkly

A challenge we face with deep-sea biodiversity work is the lack of available genetic reference databases. Using results from a large tag-sequencing survey, a team of undergraduate researchers took a deep dive into mining biological and ecological information about deep-sea protistan species.

Visit the Database Darkly website to learn more about the microeukaryotes that live in the dark ocean.

{.lightbox,fig-align=“center”, width=“450”}

3 R tutorials & workshops

R code examples and tutorials
Generating figures from HTS tag-sequencing data - OTU or ASV table as input
Learning R with BVCN
Extracting microbial eukaryotic information from metagenomes

4 R packages & builds

4.1 Axial Seamount Wordle

Play our custom Axial Seamount-themed wordle

Part of our #protatax expedition.

4.2 Woods Hole color palette

Download color palettes inspired by Cape Cod and Woods Hole area. Includes docks, sunsets, WHOI logo, and RV Atlantis. github

4.3 ExeRcise

Generate a randomized custom workout. Inspired by High Intensity Interval training workouts, user defines warm up and workout interval length and total number of exercises. Additional functions to run workout with a timer in R and a tabata-style workout. github

4.4 Lab work

Built for personal use only. List of functions to streamline molecular bench work and common math required in preparing genetic material for sequencing. e.g., PCR master mix calculation, CV=CV, and RNA quantification preparation steps. github repo

the Knob in Woods Hole, MA