How to use

Database Darkly provides insight into a deep-sea survey of microbial eukaryote sequences. Because these sequences belong to protistan species that we do not necessarily have microscopic images of or representatives in culture… we worked to compile what we know about each reference here.

As a community, we hope that this information can be used to link across other studies so we can expand what we know about these environmental strains.

1 Compare with your data

1.1 Download Database Darkly

Current version of database is available here.

Files:

Reference sequences for all ASVs & taxonomy assignments
Taxonomy assignments with integrated curation information

last update: As of June 2024, the initial upload for Database Darkly was completed.

1.2 ASV table

Data are all available on Zenodo and code to reproduce microeuk deep-sea survey are available on Github.

1.2.1 Download qiime2 data from microeuk survey

First, download qiime2-output-files_Hu-et-al.tar from Zenodo link, extract files.

mkdir qiime2-output
mv qiime2-output-files_Hu-et-al.tar qiime2-output
cd qiime2-output 
tar -xf qiime2-output-files_Hu-et-al.tar

Since DADA2 determination of ASVs is most appropriate by sequence library run, there are three separate ASV datasets. These were merged to create the microeuk-merged data.

1.2.2 Extract fasta files for reference database

You do not need QIIME2 installed to obtain the reference sequences. This can be extracted like a normal zip file. Move the reference sequences to the qiime2 output directory.

unzip microeuk-merged-ref-seqs.qza
mv a7d9b643-92c2-4be8-ac4b-c62b142474e4/data/dna-sequences.fasta microeuks-ref-seqs.fasta

Next steps involve merging the above sequence files with the count files from the original DADA2 count output.

1.2.3 Use R to merge with taxonomic IDs

Load libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import ASV table

asv_table <- read_delim("input-data/microeuk-merged-asv-table.tsv", skip = 1)

Rows: 17934 Columns: 98
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): #OTU ID
dbl (97): 101_GR_substrate_MC3_Riftia_6_0_Jun2021, 102_GR_substrate_MC3_Shel...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# head(asv_table)

Import taxonomy information

tax_table <- read_delim("input-data/taxonomy.tsv")

Rows: 17934 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Feature ID, Taxon
dbl (1): Consensus

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(tax_table)

Rows: 17,934
Columns: 3
$ `Feature ID` <chr> "000ee37747b75c12d7108ddf5c5cf3ea", "00165708dab37d750eb5…
$ Taxon        <chr> "Eukaryota;Alveolata;Ciliophora;Nassophorea;Nassophorea_X…
$ Consensus    <dbl> 0.8, 1.0, 0.8, 1.0, 1.0, 1.0, 0.8, 1.0, 0.7, 0.7, 1.0, 0.…

Import metadata

metadata <- read_delim("input-data/samplelist-metadata.csv")

Rows: 100 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): SAMPLE, VENT, COORDINATES, SITE, Sample_or_Control, SAMPLEID, SAMP...
dbl  (3): ref_num, YEAR, Perc

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(metadata)

# A tibble: 6 × 20
  ref_num SAMPLE   VENT  COORDINATES SITE  Sample_or_Control SAMPLEID SAMPLETYPE
    <dbl> <chr>    <chr> <chr>       <chr> <chr>             <chr>    <chr>     
1       1 Axial_B… Deep… 46.27389 N… Axial Sample            BSW1500m Background
2       2 Axial_A… Anem… 45.9335667… Axial Sample            Anemone… Plume     
3       3 Axial_A… Anem… 45.9332 N … Axial Sample            FS891    Vent      
4       4 Axial_B… Boca  45.927692 … Axial Sample            FS905    Vent      
5       5 Axial_D… Depe… 45.87992 N… Axial Sample            FS900    Vent      
6       6 Axial_E… El G… 45.926575 … Axial Sample            FS896    Vent      
# ℹ 12 more variables: DEPTH <chr>, YEAR <dbl>, TEMP <chr>, pH <chr>,
#   Perc <dbl>, Mg <chr>, H2. <chr>, H2S <chr>, CH4 <chr>, ProkConc <chr>,
#   Sample_actual <chr>, Type <chr>

unique(metadata$SITE) # Remove "substrate", "control", and "Laboratory" samples

[1] "Axial"      "control"    "GordaRidge" "Laboratory" "Piccard"   
[6] "substrate"  "VonDamm"

unique(metadata$Sample_or_Control) # Remove "Control" samples

[1] "Sample"  "Control"

unique(metadata$SAMPLETYPE) # only have Background, Plume, or vent samples

[1] "Background"     "Plume"          "Vent"           "Control"       
[5] "Incubation"     "Microcolonizer"

Combine the above three table types. This is our base reference database.

2 Review the taxonomic assignments

taxonomic_lineages <- tax_table %>% 
  select(-Consensus, -`Feature ID`) %>% 
  separate(Taxon, into = c("Domain", "Division", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = ";", remove = FALSE) %>% 
  filter(Domain == "Eukaryota") %>%
  distinct()

Warning: Expected 8 pieces. Additional pieces discarded in 8916 rows [1, 2, 5, 8, 9, 10,
11, 12, 16, 19, 21, 22, 23, 24, 26, 27, 28, 29, 30, 33, ...].

Warning: Expected 8 pieces. Missing pieces filled with `NA` in 9018 rows [3, 4, 6, 7,
13, 14, 15, 17, 18, 20, 25, 31, 32, 38, 39, 42, 43, 44, 46, 47, ...].

# save(taxonomic_lineages, file = "input-data/taxonomic-lineages.RData")