16S Primers:
- 341F – 5′-CCT ACG GGN GGC WGC AG-3′
- 805R – 5′-GAC TAC HVG GGT ATC TAA TCC-3′

Expected Insert Length:
- 465 bp

1 Basic Sequence Data Prep

1.1 Fastq file name formatting

Raw Fastqs are in folder 0_RawFastqs/RawFastqs_nameformatted

Names are formatted so that the identifying sample label appears before the first underscore. This makes the downstream auto fasta header relabeling seemless and ensures that when the sequences across samples are combined into 1 file, each sequence is properly labeled in the headers.
For this project, sample ids were numeric values [1,316], so again for simplifying downstream processing, I renamed each raw fastqs file to have the prefix ‘hfd12wk’ and set all numberic values to have 3 digits.
- Having only numberic values as sample identifiers is likely to create issues when reading into R/phyloseq
- Naming examples:
  - “1” → “hfd12wk001”; files = 1_S97_L001_R1_001.fastq → hfd12wk001_R1.fastq
  - “316” → “hfd12wk316”; files = 316_S337_L001_R1_001.fastq → hfd12wk316_R1.fastq
Names are formatted so that the identifying sample label appears before the first underscore. This makes the downstream auto fasta header relabeling seemless and ensures that when the sequences across samples are combined into 1 file, each sequence is properly labeled in the headers.
For this project, sample ids were numeric values [1,316], so again for simplifying downstream processing, I renamed each raw fastqs file to have the prefix ‘hfd12wk’ and set all numberic values to have 3 digits.
- Having only numberic values as sample identifiers is likely to create issues when reading into R/phyloseq
- Naming examples:
  - “1” → “hfd12wk001”; files = 1_S97_L001_R1_001.fastq → hfd12wk001_R1.fastq
  - “316” → “hfd12wk316”; files = 316_S337_L001_R1_001.fastq → hfd12wk316_R1.fastq

2 basic data overview ———————————————————–

phyloseq-class experiment-level object otu_table() OTU Table: [ 1090 taxa and 245 samples ] sample_data() Sample Data: [ 245 samples by 11 sample variables ] tax_table() Taxonomy Table: [ 1090 taxa by 7 taxonomic ranks ]

N samples: 245
N OTUs: 1,090

USEARCH version
usearch11.0.667_i86linux64

Ref db: rdp_16s_v19

24,642 sequences in taxonomic reference database

2.1 Full Datasets, unrarefied, unprunned

2.1.1 Read Counts

Table 2.1:
Seqs per sample	Full 16S OTU table
min	226
max	32,036
mean	20,585
sd	3,467
total	5,043,233

2.1.2 Unique Domains detected

16S

## [1] "Bacteria"    "uncl_Domain" "Eukaryota"

2.1.3 Reads per Domain

12-week

Table 2.2:
Domain	reads_assigned	percentage_all_reads
Bacteria	5,042,647	99.99%
uncl_Domain	437	0.01%
Eukaryota	149	0.00%

For now, only continuing with reads assigned at >= 80% confidence to domain = Bacteria or Archaea

3 Clean OTU table

3.1 Prune taxa

Domain-level:

Remove anything unclassified at Domain-level (437 reads; 69 OTUs)
Remove Domain = Eukaryota (149 reads; 1 OTUs)

Phylum-level:

Remove Phylum = “uncl_Bacteria” (12,991 reads; 357 OTUs)
Remove Phylum = “uncl_Archaea” (0 reads; 0 OTUs)
Look at Phylum “Cyanobacteriota”
- It contains these Classes: Cyanophyceae, Chloroplast

Class-level:

Remove Class = “Chloroplast” (122 reads; 4 OTUs)
Remove Class = “uncl_Cyanobacteria/Chloroplast” (0 reads; 0 OTUs)

Family-level:

Remove Family = “Mitochondria” (0 reads; 0 OTUs)

Remaining data:
- 5,029,534 total reads
- 20,528.71 average reads/sample
- 659 OTUs
- 245 samples

3.2 Removing singleton OTUs

At this point, the unrarefied 12wk 16S dataset with unwanted taxa prunned has 659 OTUs and a total of 5,029,534 reads.

3.3 12wk

32 singleton OTUs were present and removed (corresponding to 32 reads)

627 OTUs remain
5,029,502 total reads remain
- average: 20,528.58
- sd: 3,460.712

3.4 BacArch OTU Rarefaction Curves

3.4.1 12wk samples with highest read count

3.4.2 All samples, full depth range

par(mar = c(5.1, 4.1, 5.1, 2.1)) # c(bottom, left, top, right)

dev.copy(png,'/Users/L347123/Desktop/hfd-microbiome/docs/figures/rarefaction/Rarecurve_12wk_fullrange.png',width = 650)

plot_rarecurves((phylo2vegan_OTU(hfd_12wk_unrare_min2)), step = 1000,xlim=c(0,max(sample_sums(hfd_12wk_unrare_min2)*1.1)),ylim=c(0,350),
  title(main="Rarefaction Curves for all 12wk samples\nDomain=Bacteria or Archaea, taxa prunned", adj=0),
   label=T,cex.main=1.2)

dev.off()

3.4.3 All samples, max 15k seqs/samples

3.4.4 All samples, max 20k seqs/samples

3.5 Rarefying 12wk OTU table to 13,006 seqs/sample

Rarefaction threshold of 13,006 sequences per sample -> we’d lose:

samples overall (% of the total 245 samples)
– 3 OTUs (0.5% of the total 627 OTUs)
– 624 OTUs remain

Of note, I am rarefying via subsampling WITHOUT replacement and removing OTUs no longer in dataset

Indeed:
- 2 samples removed because they contained fewer reads than sample.size.
- 3 OTUs were removed because they are no longer present in any sample after random subsampling

4 Final OTU table stats

624 OTUs remain
3,160,458 total reads remain
- average: 13,006
- sd: 0

HFD Microbiome: 16S Amplicon Pre-processing of 12-week Experiment

Raw sequence processing and OTU table generation using usearch v11, OTU table cleaning using phyloseq (R), and basic data overview

Kelli Feeser

2025-04-02