The annoSnake workflow: step by step

Note

Input data

annoSnake takes either paired-end or interleaved reads (in gzipped format) as input. Reads must sit in the {INPUTDIR}, there is no need to specify SAMPLE names as annoSnake reads them automatically from the {INPUTDIR}.

Attention

There is no need to trim or filter the reads in advance.

For paired-end data:

{INPUTDIR}
├── $SAMPLE1_R1.fastq.gz
├── $SAMPLE1_R2.fastq.gz
├── $SAMPLE2_R1.fastq.gz
├── $SAMPLE2_R2.fastq.gz
├── ..._R1.fastq.gz
└── ..._R2.fastq.gz

For interleaved data:

{INPUTDIR}
├── $SAMPLE1.fastq.gz
├── $SAMPLE2.fastq.gz
└── ...fastq.gz

./profile/params.yaml file

The ./profile/params.yaml is the main configuration file sitting in the ./profile/ directory. You can specify the name of the inputdir, outdir, library_type, and more.

Tip

See the Snakemake webpage for more information on the ./profile/ directory.

The ./profile/params.yaml file

# Workflow configuration

email: "your_email_address"

# specify input data
inputdir: "input_paired_end"

# input files are 'paired-end' or 'interleaved'?
library_type: "paired-end"

# specify output directory
outdir: "results_paired_end"

# specify minimum length of contigs to output in MEGAHIT
min_length: 1500

# select whether metagenome-assembled genomes (MAGs) shall be assembled or not ('True' or 'False')
mag_assembly: True

#if 'mag_assembly: True' specify completeness and contamination of resulting bins [community standards for medium or high-quality MAGs are defined as follows: ≥50% completeness and ≤10% contamination (Bowers et al. (2017)]
completeness: 30
contamination: 10

# select databases to use ('True' or 'False')
PFAM: True
COG: True
KEGG: True
CAZYMES: True

# specify cut-off E-values
blastp_evalue: "1e-24"
blastx_evalue: "1e-24"
cog_evalue: "1e-30"
cazy_evalue: "1e-30"
pfam_evalue: "1e-30"

./profile/config.yaml file

The ./profile/config.yaml sets up the parameters for SLURM job submission on the HPC; you can change the file as you like.

The ./profile/config.yaml file

### Kudos to @jdblischak! https://github.com/jdblischak/smk-simple-slurm

cluster:
  mkdir -p {OUTDIR}/logs/{rule} &&
  sbatch
    --partition={resources.partition}
    --time={resources.time}
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --job-name={rule}.{jobid}
    --output={OUTDIR}/logs/{rule}/{rule}_{wildcards}_%J.out
    --error={OUTDIR}/logs/{rule}/{rule}_{wildcards}_%J.err

default-resources:
  - partition=medium #eg. 'medium' or 'fat' (if in doubt, contact your local HPC support)
  - time="1-00:00:00" # maximum runtime of jobs, here 1 day / 24h
  - mem_mb=150000 # required memory per node in MB

max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True
touch: False
reason: True
show-failed-logs: True

Metagenome assembly

annoSnake uses MEGAHIT v1.2.9 to assemble reads into contigs. You must specify the minimum length of contigs (default: 1500 bp) in the ./profile/params.yaml file.

If you want to change how the assembly is handled by MEGAHIT, you must change either annoSnake/workflow/rules/megahit_paired_end.smk or annoSnake/workflow/rules/megahit_interleaved.smk.

For example, if you don’t want to run MEGAHIT with --presets meta-sensitive, you can change the line into…

megahit -1 {INPUTDIR}/{wildcards.sample}_R1.fastq.gz -2 {INPUTDIR}/{wildcards.sample}_R2.fastq.gz --out-prefix {wildcards.sample} --min-contig-len {params.min_length} -o {OUTDIR}/assemblies/megahit/{wildcards.sample} -t {threads}

annoSnake modifies the Fasta headers using the SAMPLE name and quality checks all contigs with metaQuast.

{OUTDIR}/assemblies/
├── megahit/
│       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
├── metaquast/
└── preprocessed_contigs/
        ├── $SAMPLE1/
        ├── $SAMPLE2/
        └── ...

Taxonomic annotation

annoSnake uses Prokka 1.14.6 (in --metagenome mode) to identify protein-coding sequences (CDS), rRNAs, and tRNAs. To assign contigs taxonomically, 40 single copy marker genes (called COGs; in protein format) are extracted with the help of fetchMG v.1.2 extracts from the predicted CDS, which are then taxonomically assigned with DIAMOND in blastp mode. Other CDS (in nucleotide format) are taxonomically assigned with DIAMOND but in blastx mode. Both annotations make use of GTDB database ver 202.

{OUTDIR}/taxonomy/
├── prokka/
|       ├── $SAMPLE1/
|       |  ├── $SAMPLE1.faa
|       |  ├── $SAMPLE1.fna
|       |  └── ...
│       ├── $SAMPLE2/
|       |  └── ...
│       └── ...
├── blastx/
|       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
└── blastp/
        ├── $SAMPLE1/
        ├── $SAMPLE2/
        └── ...

Functional annotation

You can choose between different protein databases for functional annotation of metagenomic contigs. However, only metagenomic contigs assigned either as bacteria or archaea in the previous blastx search are annotated):

For identifying CDS with carbohydrate metabolising properties, Hidden Markov models (HMM) of CAZy domains deposited in the dbCAN database release 11 are used as default.
To search for hydrogenases, HMM searches against the Pfam database version 35 are performed.
KofamScan v1.3.0 is used to reconstruct prokaryotic metabolic pathways against the KEGG database.

Attention

Results are filtered by cut-off E-values (minimum significant hit) that must be specified by you (see ./profile/params.yaml file).

# specify cut-off E-values
blastp_evalue: "1e-24"
blastx_evalue: "1e-24"
cog_evalue: "1e-30"
cazy_evalue: "1e-30"
pfam_evalue: "1e-30"

{OUTDIR}/annotation/
├── kegg/
|       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
├── cazy/
|       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
└── pfam/
        ├── $SAMPLE1/
        ├── $SAMPLE2/
        └── ...

Attention

For prokaryotic metabolic pathways (ie., KEGG), KO profile thresholds and an E-value ≤1e-30 are used, if KEGG entries of interest (eg. K12212) are present. Otherwise, KEGG entries with the lowest E-value are taken.

Hint

Databases are downloaded automatically. You can choose to download them by yourself, however, you have to make sure that they apply to the correct format (see Databases for more details).

Abundance calculation of gene families

annoSnake makes use of Salmon v1.10.2 to calculate abundances. Salmon aligns raw reads to the contigs that were assigned to bacteria or archaea in the step before as well as to the COGs (see Taxonomic annotation). Salmon adjusts for biases such as GC-content and differences in gene length, and produces Transcripts per Million (TPM) values to represent CDS abundance. For visualisation purposes, TPM values >1 are kept and subsequently log-transformed. Normalisation of TPM counts is performed via centered log-ratio (clr) transformation; executed in the R package propr with a pseudo count of 0.65 to handle zero values appropriately.

{OUTDIR}/quantification/
├── cogs/
│       ├── cogs.index
│       └── cogs.quant
└── contigs/
        ├── $SAMPLE1/
        ├── $SAMPLE2/
        └── ...

Metagenome-assembled genomes (MAGs)

Metagenome contigs are binned into MAGs with three different binning algorithms (in default mode):

To increase contiguity and completeness of the resulting bins, we implemented metaWRAP‘s bin_refinement module, which combines the obtained bins from the three different binning algorithms to produce a consolidated, improved bin set.

Note

Here, the user needs to specify the minimum completeness and maximum contamination of retained MAGs used for downstream analyses in the ./profile/params.yaml file.

 # if 'mag_assembly: True' specify completeness and contamination of resulting bins
completeness: 30
contamination: 10

Quality control of MAGs is performed by CheckM 1.2.2. They are taxonomically classified with GTDB-Tk v2.3.2 using the GTDB database ver 202 as a reference.

Gene prediction of MAGs is performed by Prokka 1.14.6, using the --metagenome option.

Predicted protein sequences are annotated with MicrobeAnnotator with -diamond search against the KEGG database.

Note

For MAGs, pathway completeness is assessed based on presence/absence not on TPM values (see Abundance calculation of gene families).

{OUTDIR}/MAGs/
├── above_threshold_bins/ # bins with minimum completeness and maximum contamination as specified (see above)
|       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
├── bin_refinement/
|       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
├── checkm/
|       ├── $SAMPLE1/
|       ├── $SAMPLE2/
|       └── ...
├── gtdbtk/
|       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
├── maxbin2/
|       ├── $SAMPLE1/
|       ├── $SAMPLE2/
|       └── ...
├── metabat2/
│       ├── $SAMPLE1/
│       ├── $SAMPLE2/
│       └── ...
├── metacoag/
|       ├── $SAMPLE1/
|       ├── $SAMPLE2/
|       └── ...
└── prokka/
        ├── $SAMPLE1/
        ├── $SAMPLE2/
        └── ...

Fresh Install

A fresh install should look like this:

annoSnake
├── docs/
├── workflow/
│       ├── input_paired_end # includes example data
│       ├── profile
|       |     ├── config.yaml
|       |     └── params.yaml
│       ├── rules
|       |     ├── envs/ # conda environment files
|       |     ├── scripts/ # Rscripts etc.
|       |     ├── blastx.smk
|       |     ├── cazy.smk
|       |     └── ...
|       └── Snakefile
├── .git/
├── LICENSE
├── README.md
└── .readthedocs.yaml