Primers:
- 939F – 5′ TTG ACG GGG GCC CGC ACA AG-3′
- 1492R – 5′-GTT TAC CTT GTT ACG ACT T-3′

GGTTACCTTGTTAGACTTGTGCGGGCCCCCGTCAA over rep seq: CGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGG CGGTCTTGCGACCGTACTCCCCAGGCGGAGTGTTTCACGCGTTAGCTGGG GATGAACGCTGGCGGCATGCCTAACACATGCAAGTCGGACGGGAAGTGGT

fwd TTGACGGGGGCCCGCACAAG - 3889280 of 4115088

rev GTTTACCTTGTTACGACTT 1302 of 4115088 GGTTACCTTGTTACGACTT 3785444 GGCTACCTTGTTACGACTT 6109

Expected Insert Length:
- 554 bp

1 Basic Sequence Data Prep


1.1 Fastq file name formatting


Raw Fastqs are in folder 0_RawFastqs/RawFastqs_nameformatted

  • Names are formatted so that the identifying sample label appears before the first underscore. This makes the downstream auto fasta header relabeling seemless and ensures that when the sequences across samples are combined into 1 file, each sequence is properly labeled in the headers.

  • For this project, sample ids were numeric values [1,316], so again for simplifying downstream processing, I renamed each raw fastqs file to have the prefix ‘hfd12wk’ and set all numberic values to have 3 digits.

    • Having only numberic values as sample identifiers is likely to create issues when reading into R/phyloseq
    • Naming examples:
      • “1” → “hfd12wk001”; files = 1_S97_L001_R1_001.fastq → hfd12wk001_R1.fastq
      • “316” → “hfd12wk316”; files = 316_S337_L001_R1_001.fastq → hfd12wk316_R1.fastq
  • Names are formatted so that the identifying sample label appears before the first underscore. This makes the downstream auto fasta header relabeling seemless and ensures that when the sequences across samples are combined into 1 file, each sequence is properly labeled in the headers.

  • For this project, sample ids were numeric values [1,316], so again for simplifying downstream processing, I renamed each raw fastqs file to have the prefix ‘hfd12wk’ and set all numberic values to have 3 digits.

    • Having only numberic values as sample identifiers is likely to create issues when reading into R/phyloseq
    • Naming examples:
      • “1” → “hfd12wk001”; files = 1_S97_L001_R1_001.fastq → hfd12wk001_R1.fastq
      • “316” → “hfd12wk316”; files = 316_S337_L001_R1_001.fastq → hfd12wk316_R1.fastq


2 basic data overview ———————————————————–

phyloseq-class experiment-level object otu_table() OTU Table: [ 1090 taxa and 245 samples ] sample_data() Sample Data: [ 245 samples by 11 sample variables ] tax_table() Taxonomy Table: [ 1090 taxa by 7 taxonomic ranks ]

  • N samples: 245
  • N OTUs: 1,090

USEARCH version
usearch11.0.667_i86linux64



Ref db: rdp_16s_v19

  • 24,642 sequences in taxonomic reference database


2.1 Full Datasets, unrarefied, unprunned

2.1.1 Read Counts

Table 2.1:
Seqs per sampleFull 16S OTU table
min226
max32,036
mean20,585
sd3,467
total5,043,233


2.1.2 Unique Domains detected


16S

## [1] "Bacteria"    "uncl_Domain" "Eukaryota"


2.1.3 Reads per Domain


12-week

Table 2.2:
Domainreads_assignedpercentage_all_reads
Bacteria5,042,64799.99%
uncl_Domain4370.01%
Eukaryota1490.00%


For now, only continuing with reads assigned at >= 80% confidence to domain = Bacteria or Archaea



3 Clean OTU table

3.1 Prune taxa

Domain-level:

  1. Remove anything unclassified at Domain-level (437 reads; 69 OTUs)

  2. Remove Domain = Eukaryota (149 reads; 1 OTUs)

Phylum-level:

  1. Remove Phylum = “uncl_Bacteria” (12,991 reads; 357 OTUs)

  2. Remove Phylum = “uncl_Archaea” (0 reads; 0 OTUs)

  3. Look at Phylum “Cyanobacteriota”

    • It contains these Classes: Cyanophyceae, Chloroplast

Class-level:

  1. Remove Class = “Chloroplast” (122 reads; 4 OTUs)

  2. Remove Class = “uncl_Cyanobacteria/Chloroplast” (0 reads; 0 OTUs)

Family-level:

  1. Remove Family = “Mitochondria” (0 reads; 0 OTUs)

Remaining data:
- 5,029,534 total reads
- 20,528.71 average reads/sample
- 659 OTUs
- 245 samples


3.2 Removing singleton OTUs

At this point, the unrarefied 12wk 16S dataset with unwanted taxa prunned has 659 OTUs and a total of 5,029,534 reads.

3.3 12wk

32 singleton OTUs were present and removed (corresponding to 32 reads)

  • 627 OTUs remain

  • 5,029,502 total reads remain

    • average: 20,528.58

    • sd: 3,460.712


3.4 BacArch OTU Rarefaction Curves


3.4.1 12wk samples with highest read count


3.4.2 All samples, full depth range

par(mar = c(5.1, 4.1, 5.1, 2.1)) # c(bottom, left, top, right)

dev.copy(png,'/Users/L347123/Desktop/hfd-microbiome/docs/figures/rarefaction/Rarecurve_12wk_fullrange.png',width = 650)

plot_rarecurves((phylo2vegan_OTU(hfd_12wk_unrare_min2)), step = 1000,xlim=c(0,max(sample_sums(hfd_12wk_unrare_min2)*1.1)),ylim=c(0,350),
  title(main="Rarefaction Curves for all 12wk samples\nDomain=Bacteria or Archaea, taxa prunned", adj=0),
   label=T,cex.main=1.2)

dev.off()

3.4.3 All samples, max 15k seqs/samples

3.4.4 All samples, max 20k seqs/samples

3.5 Rarefying 12wk OTU table to 13,006 seqs/sample


Rarefaction threshold of 13,006 sequences per sample -> we’d lose:

  • samples overall (% of the total 245 samples)
    – 3 OTUs (0.5% of the total 627 OTUs)
    – 624 OTUs remain

Of note, I am rarefying via subsampling WITHOUT replacement and removing OTUs no longer in dataset


Indeed:
- 2 samples removed because they contained fewer reads than sample.size.
- 3 OTUs were removed because they are no longer present in any sample after random subsampling

4 Final OTU table stats

  • 624 OTUs remain

  • 3,160,458 total reads remain

    • average: 13,006

    • sd: 0