MetaMAG Explorer - Pipeline Execution Guide

🚀 Quick Start: Run the Pipeline

If you're looking for the fastest way to execute the pipeline:

Use the provided run.sh script.

It's preconfigured to handle most parameters, and you only need to update your file paths and step names.

What You Need To Do:

Open run.sh in a text editor.
Update these key variables:
- WORK_DIR -- your project directory
- SAMPLES -- path to your sample list file
- PROJECT_CONFIG -- your YAML config
- STEPS -- choose which step(s) to run
- CONDA_ENV -- path to your conda environment
- Other relevant paths (e.g., reference genome, Kraken DB)
Submit the job to SLURM:
bash
```
sbatch run.sh
```

ℹ️Note

Alternatively, if you'd like to run steps in a more customized or incremental way, refer to the detailed instructions below.

SLURM Setup Example

Here's a complete example of a SLURM submission script with all necessary components:

bash

#!/bin/bash
#SBATCH -p ghpc                    # Partition/queue name
#SBATCH -N 1                       # Number of nodes
#SBATCH -n 1                       # Number of tasks
#SBATCH --mem=10G                  # Memory allocation
#SBATCH -t 2:30:00                 # Time limit (HH:MM:SS)
#SBATCH --output=pipeline_meta_%j.out    # Standard output log
#SBATCH --error=pipeline_meta_%j.err     # Standard error log

# CRITICAL: Set your conda environment path (adjust to YOUR installation)
CONDA_ENV="/path/conda/activate metamag"

# Navigate to MetaMAG directory
cd /path/MetaMAG-1.0.0

# Activate conda environment
source $CONDA_ENV

# Run the pipeline with specified step
python3 -m MetaMAG.main \
  --project_config plant_project_config.yaml \
  --samples-file samples.txt \
  --batch_size 16 \
  --cpus 1 \
  --memory 20G \
  --time 15-12:00:00 \
  --log_dir ./logs/plant/qc \
  --steps qc

⚠️Adjust SLURM Parameters Based on Your Step

Different pipeline steps require different resources:

QC: Low resources (4 CPUs, 8G memory, 2 hours)
Trimming: Medium resources (8 CPUs, 16G memory, 4 hours)
Assembly: High resources (32-64 CPUs, 100-200G memory, 1-2 days)
Binning: High resources (32 CPUs, 64G memory, 12 hours)
GTDB-Tk: Very high memory (64 CPUs, 200G memory, 6 hours)

Understanding Batch Size Parameter

ℹ️ Critical: How Batch Size Works

The --batch_size parameter determines how many samples are processed together in a single job.

If you have 20 samples and set --batch_size 20 → 1 job processes all 20 samples
If you have 20 samples and set --batch_size 10 → 2 parallel jobs, each processing 10 samples
If you have 20 samples and set --batch_size 1 → 20 parallel jobs, each processing 1 sample

Steps That Must Run as Single Batch

The following steps process all data together and should always use a batch_size equal to your total sample count:

Single Batch Steps (set batch_size = total samples)

multiqc - Combines all QC reports into one
co_assembly - Combines all samples for assembly
evaluation - Evaluates all MAGs together with CheckM
dRep - Dereplicates all MAGs as a set
gtdbtk - Classifies all MAGs in one run
all Novel MAG steps - Process novel MAGs as a complete set
all Functional Annotation steps - Annotate all MAGs together
abundance_estimation - Processes all samples but in single workflow
all Phylogenetic Tree steps - Build tree from all MAGs
all Visualization steps - Generate plots from complete data

Steps That Can Be Parallelized

These steps process individual samples and benefit from smaller batch sizes for parallel execution:

Parallelizable Steps (adjust batch_size for parallelization)

qc - Each sample's QC can run independently
trimming - Each sample trimmed separately
host_removal - Each sample processed independently
single_assembly - Each sample assembled separately (use batch_size 1-2 due to high memory)
single_binning - Each sample's assembly binned separately
single_bin_refinement - Each sample's bins refined separately
kraken_abundance - Each sample classified independently (but NOT abundance_estimation)

Recommended Batch Size Strategy

✅ General Guidelines

For a project with N samples:

Light steps (QC, trimming): batch_size = N/4 to N/2 (run 2-4 parallel jobs)
Memory-intensive steps (assembly, binning): batch_size = 1 to 2 (maximize parallel jobs)
Single-batch steps: batch_size = N or higher (run as one job)
Cluster limits: Consider your cluster's max job submission limit

⚠️ Memory Consideration

Remember: Lower batch_size means more parallel jobs but also more total memory usage across all jobs. For memory-intensive steps like assembly, use batch_size 1-2 even if it means many parallel jobs.

File Naming Requirements

📁 Input File Naming Conventions

Files must follow these naming patterns:

{sample}_1.fastq.gz and {sample}_2.fastq.gz
{sample}_R1.fastq.gz and {sample}_R2.fastq.gz
{sample}_forward.fastq.gz and {sample}_reverse.fastq.gz

Configuration Setup

Project Configuration File (ONLY 3 LINES NEEDED)

yaml

# project_config.yaml - ONLY THESE 3 LINES
input_dir: "/path/to/raw/reads"
output_dir: "/path/to/output"
reference: "/path/to/host/genome.fa"  # Optional for host removal

Sample List Format

text

# samples.txt
Sample_001
Sample_002
Sample_003

Step 1: Quality Control

What it does:

Runs initial quality assessment on raw sequencing reads
Generates quality reports using FastQC

Key Outputs:

Quality control reports
Sequence quality metrics

Step 1a: Run FastQC

bash

# CORRECT COMMAND FORMAT
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps qc \
  --batch_size 20 \
  --cpus 4 \
  --memory "8G" \
  --time "0-02:00:00" \
  --log_dir ./logs

Step 1b: Generate MultiQC Report

bash

# No samples file needed for MultiQC
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps multiqc \
  --input_dir $WORK_DIR/QC \
  --batch_size 1 \
  --cpus 2 \
  --memory "4G" \
  --time "0-00:30:00" \
  --log_dir ./logs

Step 2: Preprocessing

Step 2a: Trimming (Fastp)

What it does:

Removes low-quality bases
Trims adapter sequences
Filters out poor-quality reads

Key Outputs:

Cleaned read files
Trimming statistics

Hardcoded parameters: Q20 quality, 30bp minimum length

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps trimming \
  --batch_size 10 \
  --cpus 8 \
  --memory "16G" \
  --time "0-04:00:00" \
  --log_dir ./logs

Step 2b: Host Removal

What it does:

Removes host-derived sequences
Maps reads against reference genome
Extracts non-host (microbial) reads

Key Outputs:

Host-free metagenomic reads
Mapping statistics

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps host_removal \
  --reference $REFERENCE \
  --batch_size 10 \
  --cpus 16 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Step 3: Assembly

Step 3a: Single Assembly (IDBA-UD)

What it does:

Assembles reads for individual samples
Generates contigs using IDBA-UD assembler

Key Outputs:

Assembled contigs
Assembly statistics

bash

# ALWAYS uses IDBA-UD (no --assembler parameter)
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_assembly \
  --batch_size 5 \
  --cpus 32 \
  --memory "100G" \
  --time "1-00:00:00" \
  --log_dir ./logs

Step 3b: Co-Assembly (MEGAHIT)

What it does:

Combines reads from multiple samples
Generates a unified assembly using MEGAHIT

Key Outputs:

Co-assembled contigs
Merged assembly statistics

Hardcoded: k-mer 21-141, step 12

bash

# ALWAYS uses MEGAHIT with k-mer 21-141
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps co_assembly \
  --batch_size 50 \
  --cpus 64 \
  --memory "200G" \
  --time "2-00:00:00" \
  --log_dir ./logs

Step 4: Binning & Refinement

Step 4a: Binning

What it does:

Clusters contigs into potential genome bins
Uses multiple binning algorithms (MetaBAT2, MaxBin2, CONCOCT)

Key Outputs:

Initial genome bins
Binning statistics

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_binning \
  --binning_methods metabat2 maxbin2 concoct \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir ./logs

Step 4b: Refinement with DAS Tool

What it does:

Refines and merges bins from multiple methods
Improves bin quality using DAS Tool
Filters bins to process only valid naming patterns

Key Outputs:

Refined, high-quality genome bins
DAS Tool quality scores
Filtered bin directories

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_bin_refinement \
  --score_threshold 0.5 \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

ℹ️ DAS Tool Processing Notes

Score threshold: 0.5 (can be adjusted with --score_threshold)
Automatically handles missing or empty bin directories
Creates TSV mapping files for each binning method
Continues pipeline even if DAS Tool fails due to low-quality bins

Step 5: Quality Assessment & Dereplication

Step 5a: Quality Assessment with CheckM

What it does:

Runs CheckM lineage workflow to assess genome quality
Estimates completeness using lineage-specific marker genes
Calculates contamination based on duplicated markers
Generates detailed quality metrics and visualization plots

CheckM Workflow Details:

Method: lineage_wf (lineage-specific workflow)
Marker set: Automatically selected based on taxonomic placement
Extension: .fa files

Key Outputs:

CheckM Raw Outputs:
- lineage.ms - Marker lineage information
- storage/bin_stats_ext.tsv - Raw CheckM statistics
- checkm_output.log - CheckM execution log
Processed Outputs:
- cleaned_checkm_output.csv - Formatted quality metrics table
- plots/completeness_contamination_plot.pdf - Bubble plot (size = genome size)
- plots/genome_quality_plot.pdf - Quality categories with histograms
- plots/scaffolds_histogram.pdf - Scaffold count distribution

ℹ️ CheckM Processing Notes

CheckM lineage_wf is used (not qa or taxonomy_wf)
Automatically processes bin_stats_ext.tsv to create cleaned CSV
Skips execution if all outputs exist (delete output dir to re-run)
Quality categories:
- Near Complete: ≥90% complete, ≤5% contamination
- Medium Quality: ≥70% complete, ≤10% contamination
- Partial: ≥50% complete, ≤4% contamination

Run CheckM on dRep output (default):

bash

# Runs CheckM on dereplicated genomes from dRep
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Run CheckM on custom genome directory:

bash

# Evaluate genomes from a custom directory
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation \
  --evaluation_input_dir /path/to/custom/genomes \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Step 5b: Dereplication with dRep

What it does:

Removes redundant MAGs across samples
Clusters genomes based on ANI similarity
Selects representative genome from each cluster

Key Outputs:

dereplicated_genomes/ - Non-redundant MAG set
data_tables/genomeInfo.csv - Quality and clustering information
passing_genomes.csv - List of genomes passing QC (if recovery mode activated)

ℹ️ dRep Configuration (Hardcoded Parameters)

Minimum completeness: 80%
Maximum contamination: 10%
Strain heterogeneity: 100
ANI method: fastANI
Secondary clustering ANI: 95%

Automatic Recovery: If dRep fails due to insufficient genomes passing quality thresholds, the pipeline automatically identifies and copies all genomes meeting the thresholds to a 'dereplicated_genomes' directory. A 'passing_genomes.csv' file is generated listing all genomes that passed QC.

Bin Renaming: All bins are automatically renamed with their source prefix (e.g., Sample_001_bin.1.fa for single samples, coassembly_bin.1.fa for co-assembly).

Run CheckM + dRep together:

bash

# Run both CheckM evaluation and dRep dereplication
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation dRep \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Step 6: Taxonomic Classification (GTDB-Tk)

What it does:

Assigns taxonomic classifications to MAGs
Uses Genome Taxonomy Database (GTDB) for classification
Runs classify_wf workflow only (not de novo workflow)

Key Outputs:

classify/gtdbtk.bac120.summary.tsv - Bacterial classifications
classify/gtdbtk.ar53.summary.tsv - Archaeal classifications
Phylogenetic placement information

ℹ️ GTDB-Tk Path Configuration

Input Path (Fixed):

{project_output}/Bin_Refinement/drep/dRep_output/dereplicated_genomes

Output Path (Fixed):

{project_output}/Novel_Mags/gtdbtk

Processing Notes:

ANI screening is disabled (--skip_ani_screen) for faster processing
Only runs taxonomic classification, not phylogenetic tree inference
File extension expected: .fa
Automatically creates output directory structure

bash

# NO samples file needed
# Paths are automatically determined from project config
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps gtdbtk \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "0-06:00:00" \
  --log_dir ./logs

⚠️ Memory Requirements

GTDB-Tk requires substantial memory (~200GB) due to the reference database size. Ensure adequate resources are allocated. The tool may fail with insufficient memory without clear error messages.

Step 7: Rumen Reference MAGs Processing (Optional)

ℹ️ For Rumen Microbiome Projects

These steps are specifically designed for rumen microbiome studies. They integrate high-quality reference genomes from major rumen microbiome databases (RGMGC, MGnify, RUG).

Step 7.1: Download Reference MAGs

What it does:

Downloads reference MAGs from three major sources:
- RGMGC: 10,373 genomes from ENA (European Nucleotide Archive)
- MGnify: 2,729 genomes from cow rumen species catalogue
- RUG: 4,941 genomes from Rumen Uncultured Genomes dataset
Automatically handles download retries and error recovery
Processes downloaded files (extracts .gz, standardizes to .fa extension)
Total: ~18,043 reference genomes if all downloaded successfully

Requirements:

Internet connection for downloading
~50-100GB disk space
dataset_file_list.txt and ena_mags_list.tsv files in script directory

Key Outputs:

All files standardized with .fa extension in single directory

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps rumen_refmags_download \
  --dataset_dir /path/to/store/reference_mags \
  --batch_size 1 \
  --cpus 4 \
  --memory "16G" \
  --time "0-12:00:00" \
  --log_dir ./logs

ℹ️

The download process checks for existing files and skips already downloaded genomes, making it resumable if interrupted.

Step 7.2: Dereplicate Reference MAGs

What it does:

Uses dRep to dereplicate the downloaded reference MAG collection
Removes redundant genomes based on ANI thresholds
Selects highest quality representative from each cluster

dRep Parameters (Hardcoded):

Completeness threshold: 80%
Contamination threshold: 10%
ANI method: fastANI
Secondary clustering: 95% ANI

Key Outputs:

dereplicated_genomes/ - Representative genomes
data_tables/genomeInfo.csv - Quality and clustering information

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps rumen_refmags_drep \
  --dataset_dir /path/to/downloaded/reference_mags \
  --drep_output_dir /path/to/dereplicated_output \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "1-00:00:00" \
  --log_dir ./logs

Step 7.3: Classify Reference MAGs with GTDB-Tk

What it does:

Runs GTDB-Tk classify_wf on reference MAGs
Identifies novel MAGs (without species-level classification)
Processes bacterial and archaeal domains separately
Can skip GTDB-Tk if already run (--only_novel flag)

Key Outputs:

classify/gtdbtk.bac120.summary.tsv - Bacterial classifications
classify/gtdbtk.ar53.summary.tsv - Archaeal classifications
novel_rumenref_mags/mags/ - Novel MAG files
novel_rumenref_mags/taxonomy/ - Novel MAG taxonomy files

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps rumen_refmags_gtdbtk \
  --genome_dir /path/to/dereplicated_genomes \
  --gtdbtk_out_dir /path/to/gtdbtk_output \
  --novel_output_dir /path/to/novel_mags_output \
  --genome_extension ".fa" \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "0-06:00:00" \
  --log_dir ./logs

# To skip GTDB-Tk if already run (only identify novel MAGs):
# Add --only_novel flag

Step 7.4: Integrate Project MAGs with Reference MAGs

What it does:

Combines your project MAGs with reference MAGs
Performs co-dereplication of combined set
Identifies which project MAGs are truly novel
Creates unified non-redundant collection

Key Outputs:

Novel_Mags/UniqueProjMags/ - Project-specific unique MAGs
Novel_Mags/UniqueRefMags/ - Reference-specific unique MAGs
uniqueness_summary.txt - Dereplication statistics

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps rumen_drep \
  --ref_mags_dir /path/to/reference/dereplicated_genomes \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-12:00:00" \
  --log_dir ./logs

Step 8: Novel MAG Processing

Step 8.1: Identify Novel MAGs

What it does:

Analyzes GTDB-Tk results to find MAGs without species designation
Extracts novel genome candidates from classification
Copies novel MAGs to dedicated directory

Key Outputs:

Novel_Mags/UniqueMags/ - Candidate novel MAGs

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps identify_novel_mags \
  --custom_gtdbtk_dir /path/to/gtdbtk/results \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

Step 8.2: Process Novel MAGs (Complete Pipeline)

What it does:

Comprehensive pipeline for novel MAG processing
Dereplicates against existing repository
For rumen data: dereplicates against reference MAGs
Adds to MAGs repository
Builds/updates Kraken database

Required for Rumen Data:

--is_rumen_data flag
--rumen_ref_mags_dir (reference MAGs)
--rumen_added_mags_dir (previously added MAGs)

Key Outputs:

Novel_Mags/filtered_NMAGs/ - Post-repository dereplication
Novel_Mags/true_novel_MAGs/ - Final novel MAGs
MAGs_Repository/ - Updated repository
Kraken_Database/ - Updated Kraken2 database

bash

# For general data:
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps process_novel_mags \
  --kraken_db_path /path/to/kraken_db \
  --merge_mode \
  --batch_size 1 \
  --cpus 64 \
  --memory "128G" \
  --time "0-12:00:00" \
  --log_dir ./logs

# For rumen data:
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps process_novel_mags \
  --is_rumen_data \
  --rumen_ref_mags_dir /path/to/rumen/reference/mags \
  --rumen_added_mags_dir /path/to/rumen/added/mags \
  --kraken_db_path /path/to/kraken_db \
  --merge_mode \
  --batch_size 1 \
  --cpus 64 \
  --memory "128G" \
  --time "0-12:00:00" \
  --log_dir ./logs

⚠️ Merge Mode

--merge_mode: Updates existing Kraken database (recommended)

--no_merge: Creates fresh database (use carefully)

Step 8.3: Add MAGs to Repository

What it does:

Adds validated novel MAGs to central repository
Maintains collection for future reuse

Key Outputs:

MAGs_Repository/ - Updated with new MAGs

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps add_mags_to_repo \
  --batch_size 1 \
  --cpus 8 \
  --memory "16G" \
  --time "0-01:00:00" \
  --log_dir ./logs

Step 8.4: Build Kraken Database

What it does:

Creates/updates Kraken2 database with novel MAGs
Uses advanced taxonomy resolver for conflicts
Builds Bracken database for abundance estimation

Processing Steps:

Placeholder taxa assignment
GTDB to Taxdump conversion
Header renaming for Kraken
Add to library
Build Kraken database
Build Bracken database
Build distribution

Key Outputs:

Kraken_Database/taxonomy/ - nodes.dmp, names.dmp, taxid.map
Kraken_Database/library/ - MAG sequences
Kraken_Database/*.k2d - Database files
Bracken database files for abundance estimation

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps build_kraken_db \
  --kraken_db_path /path/to/kraken_db \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "1-00:00:00" \
  --log_dir ./logs

ℹ️ Required Scripts

The pipeline requires these scripts in the main directory:

gtdb_to_taxdump_latest_resolve.py - Advanced taxonomy resolver
header_kraken.py - Header renaming script

Recommended Workflow for Rumen Microbiome Data

Download References (7.1) → Dereplicate References (7.2) → Classify References (7.3) → Process Your Samples (Steps 1-6) → Integrate with References (7.4) → Identify Novel MAGs (8.1) → Process Novel MAGs (8.2) → Build Database (8.4) → Continue Pipeline (Steps 9+)

Step 9: Functional Annotation

ℹ️ Comprehensive Functional Analysis Pipeline

The functional annotation pipeline includes three main components: EggNOG annotation for orthology groups, dbCAN for CAZyme identification, and integrated functional analysis with visualization.

Step 9.1: EggNOG Annotation

What it does:

Predicts proteins from MAGs using Prodigal (if not already done)
Annotates genes with functional information using EggNOG-mapper
Uses Diamond for fast alignment (falls back to HMMER if needed)
Extracts KO terms, COG categories, and CAZy annotations
Creates temporary database in /dev/shm for performance

Key Outputs:

eggnog_output/*.emapper.annotations - Raw EggNOG annotations
functional_annotations/KOs/ - KEGG Orthology terms
functional_annotations/COGs/ - COG categories
functional_annotations/CAZy/ - CAZyme annotations

Configuration Requirements:

eggnog_mapper path in config
eggnog_db_dir path in config
prodigal path in config

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps eggnog_annotation \
  --mags_dir /path/to/final_mags \
  --batch_size 1 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir ./logs

Step 9.2: dbCAN CAZyme Annotation

What it does:

Identifies Carbohydrate-Active Enzymes (CAZymes)
Predicts proteins if needed (shares with EggNOG)
Runs dbCAN2 with HMMER for CAZyme detection
Summarizes CAZyme families and categories
Creates visualization plots automatically
Groups CAZymes by substrate specificity

Key Outputs:

dbcan_output/*_dbcan_results/ - dbCAN results per MAG
cazyme_annotations/cazyme_counts/ - Count matrices
cazyme_annotations/*.xlsx - Excel summaries
cazyme_annotations/plots/ - Visualization plots:
- CAZyme heatmaps
- Category distributions
- Substrate-specific groupings
- Dendrogram clustering

Configuration Requirements:

dbcan path in config
dbcan_db_dir path in config
dbcan_activate conda environment path (optional)
prodigal path in config

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps dbcan_annotation \
  --mags_dir /path/to/final_mags \
  --batch_size 1 \
  --cpus 32 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir ./logs

ℹ️ Automatic Visualizations

dbCAN automatically generates multiple plots including heatmaps, category distributions, substrate groupings, and clustering dendrograms. These require matplotlib, seaborn, and pandas to be installed.

Step 9.3: Integrated Functional Analysis

What it does:

Integrates results from EggNOG and dbCAN
Maps KO terms to KEGG pathway hierarchy
Creates comprehensive functional profiles
Generates advanced visualizations
Produces HTML summary report
Optional species-based analysis

Key Outputs:

functional_analysis/ - Main analysis directory with:
- files/ko/ - KO term matrices
- files/cog/ - COG category matrices
- files/kegg/ - KEGG pathway mappings
- plots/ - Various visualization types:
  - COG main category distributions
  - KEGG Sankey diagrams
  - Network visualizations
  - Hierarchical clustered heatmaps
  - Circular plots
  - Treemaps and bubble plots
- reports/functional_analysis_report.html - Interactive HTML report

Required Files:

ko.txt - KEGG database file (optional but recommended)
EggNOG annotation results from Step 9.1

bash

# Basic functional analysis
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps functional_analysis \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

# With KEGG database file for pathway mapping
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps functional_analysis \
  --kegg_db_file /path/to/ko.txt \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

⚠️ KEGG Database File

The ko.txt file enables KEGG pathway hierarchy mapping. Without it, the analysis will skip pathway-level visualizations but still process COG and KO annotations.

Step 9.4: Advanced Visualizations

Prerequisites:

✅ Must complete Step 9.1 (EggNOG annotation) first
✅ Must complete Step 9.3 (Functional Analysis) first
Optional: Step 9.2 (dbCAN) for CAZyme visualizations

What it does:

Creates comprehensive publication-quality visualizations
Generates two complete sets: "All MAGs" and "Novel MAGs Only"
Produces 8+ visualization types automatically:
- CAZyme heatmaps (clustered and phylum-grouped)
- COG functional Sankey diagrams
- KEGG pathway Sankey diagrams
- KEGG metabolism detailed flows
- Functional PCA analyses
- Network visualizations with pie charts
- Integrated multi-omics views
Automatically handles MAG naming convention differences
Limits some visualizations to 30-40 MAGs for clarity

Key Outputs:

Advanced_visualizations{suffix}/ - Main directory with all MAGs
Advanced_visualizations{suffix}/Novel_MAGs_Only/ - Novel MAGs subset
Output formats: PDF, PNG, SVG, and interactive HTML
Word-optimized versions with larger fonts (*_WORD_COMPACT.png)

Required Python Libraries:

matplotlib, seaborn, pandas, numpy
scipy, scikit-learn
plotly (optional - for interactive plots)
networkx (optional - for network analysis)

bash

# Run advanced visualizations
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps advanced_visualizations \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir ./logs

# With output suffix for organization
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps advanced_visualizations \
  --viz_output_suffix "_final" \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir ./logs

ℹ️ Processing Notes

Runs all visualizations automatically in two phases
Phase 1: Analyzes all MAGs in the project
Phase 2: Creates separate visualizations for novel MAGs only
Use --viz_output_suffix to organize multiple runs
Large projects may take 2-4 hours to complete

Running Complete Functional Annotation Pipeline

bash

# Run all functional annotation steps sequentially
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps eggnog_annotation dbcan_annotation functional_analysis \
  --mags_dir /path/to/final_mags \
  --kegg_db_file /path/to/ko.txt \
  --batch_size 1 \
  --cpus 32 \
  --memory "64G" \
  --time "0-18:00:00" \
  --log_dir ./logs

Step 10: Abundance Estimation

ℹ️ Comprehensive Abundance Analysis Pipeline

This step performs taxonomic classification and abundance estimation using your custom Kraken2 database (with novel MAGs) and Bracken for accurate abundance quantification.

Prerequisites

Required Before Running:

✅ Host removal completed (Step 2b)
✅ Kraken2 database built (Step 8.4)
✅ Bracken database files exist (built with Step 8.4)

Option 1: Single Sample Kraken Classification

What it does:

Runs Kraken2 classification on individual samples
Generates per-sample classification reports
Simple classification without abundance refinement

Key Outputs:

Kraken_Abundance/{sample}_kraken.txt - Raw classifications
Kraken_Abundance/{sample}_kreport.txt - Kraken reports

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps kraken_abundance \
  --kraken_db_path /path/to/kraken_db \
  --batch_size 10 \
  --cpus 16 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir ./logs

Option 2: Comprehensive Abundance Estimation (Recommended)

What it does:

Runs Kraken2 classification on all samples
Refines abundances using Bracken at multiple taxonomic levels
Combines results into abundance matrices
Generates comprehensive visualizations
Creates interactive HTML dashboard
Supports smart checkpointing (skips completed steps)

Key Features:

Smart Checkpointing: Automatically skips already completed steps
Multiple Taxonomic Levels: Analyzes at Species, Genus, etc.
Metadata Integration: Group comparisons if metadata provided
Automatic Visualizations: 8+ plot types generated

Key Outputs:

Kraken_Abundance/ - Raw Kraken2 outputs
Bracken_Abundance_{level}/ - Bracken estimates per taxonomic level
Merged_Bracken_Outputs/ - Combined abundance matrices
- merged_bracken_S.txt - Species abundances
- merged_bracken_G.txt - Genus abundances
Abundance_Plots/ - Visualizations
- Top taxa barplots
- Abundance heatmaps with clustering
- PCA analysis plots
- Alpha diversity metrics (Shannon, Simpson, Richness)
- Stacked composition plots
- Group comparison boxplots (if metadata provided)
- Beta diversity analysis (if metadata provided)
abundance_summary.html - Interactive dashboard

Basic Usage (without metadata):

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps abundance_estimation \
  --kraken_db_path /path/to/kraken_db \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Advanced Usage (with metadata for group comparisons):

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps abundance_estimation \
  --kraken_db_path /path/to/kraken_db \
  --taxonomic_levels S G F \
  --read_length 150 \
  --threshold 10 \
  --metadata_file /path/to/metadata.csv \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Metadata File Format:

csv

Sample,Treatment
Sample_001,Control
Sample_002,Control
Sample_003,Treatment_A
Sample_004,Treatment_A
Sample_005,Treatment_B

⚠️ Important Notes

Kraken Database Path: Must point to the database built in Step 8.4
Bracken Requirements: Database must have kmer distribution files (database{read_length}mers.kmer_distrib)
Input Files: Uses host-removed reads from Host_Removal directory
Taxonomic Levels: S=Species, G=Genus, F=Family, O=Order, C=Class, P=Phylum
Force Rerun: Delete output directories to force re-execution if needed

Parameters Explained

Configurable Parameters:

--kraken_db_path (Required): Path to Kraken2 database
--taxonomic_levels: List of levels to analyze (default: S G)
--read_length: Read length for Bracken (default: 150)
--threshold: Minimum reads for Bracken classification (default: 10)
--metadata_file: CSV with Sample and Treatment columns (optional)

Resource Recommendations:

CPUs: 16-32 (Kraken2 is highly parallel)
Memory: 32-64GB (depends on database size)
Time: 4-6 hours for 50 samples

Visualization Examples

Generated Visualizations Include:

Top Taxa Barplot: Shows most abundant species/genera
Abundance Heatmap: Hierarchical clustering of samples and taxa
PCA Analysis: Sample relationships based on composition
Diversity Metrics: Shannon, Simpson, and Richness indices
Stacked Composition: Relative abundances per sample
Group Comparisons: Statistical comparisons between treatments (if metadata)
Beta Diversity: Within vs between group distances (if metadata)

Step 11: Phylogenetic Tree Generation and Visualization

ℹ️ Optional Advanced Analysis

Generate custom phylogenetic trees and create publication-quality visualizations with integrated taxonomic and functional annotations.

Step 11a: Generate Phylogenetic Tree

What it does:

Builds custom phylogenetic tree using GTDB-Tk de_novo_wf
Uses only user MAGs (excludes GTDB reference genomes)
Requires outgroup taxon for proper tree rooting
Creates standardized tree files for visualization

Requirements:

Must complete GTDB-Tk classification (Step 6) first
Requires dereplicated MAGs from dRep
Outgroup taxon must be specified

Key Outputs:

Phylogeny/gtdbtk/taxonomy2/gtdbtk.tree - Main tree file
Phylogeny/gtdbtk/taxonomy2/trees/ - All tree files
custom_taxonomy.txt - Taxonomy mapping file

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps mags_tree \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --outgroup_taxon "p__Firmicutes" \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-06:00:00" \
  --log_dir ./logs

⚠️ Outgroup Selection

Choose an appropriate outgroup taxon based on your data:

p__Firmicutes - Common for Bacteroidetes-rich samples
c__Bacilli - More specific class-level outgroup
p__Proteobacteria - For Firmicutes-dominated samples

Step 11b: Basic Tree Visualization

What it does:

Creates basic circular phylogenetic tree visualization
Adds taxonomic information as colored rings
Generates publication-ready PDF output

Key Outputs:

tree_visualization.pdf - Basic circular tree

bash

python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

Step 11c: Enhanced Tree Visualization with Functional Annotations

What it does:

Creates multiple publication-quality tree visualizations
Integrates taxonomic, functional, and novel MAG information
Generates 4 different visualization types automatically
Adds functional annotation as pie charts or rings

Visualization Types Created:

Simplified Taxonomic (_simplified_taxonomic.pdf)
- Circular tree with phylum and genus rings only
- Novel MAG highlighting with red stars
Rectangular Tree (_rectangular_tree.pdf)
- Rectangular layout with colored taxonomy strips
- Enhanced fonts for publication quality
Multi-functional (_multi_functional.pdf)
- Circular tree with CAZyme, COG, and KEGG pie charts
- Multiple annotation rings
Enhanced Beautiful (_enhanced_beautiful.pdf)
- Single functional annotation focus
- CAZyme category pie charts

Key Outputs:

Phylogeny/gtdbtk/taxonomy2/enhanced_visualizations/ - All visualization files
enhanced_visualizations_finalrec/ - Rectangular and simplified trees
enhanced_visualizations_pie/ - Pie chart visualizations

bash

# Full enhanced visualization with all annotations
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --cog_annotations $WORK_DIR/Functional_Annotation/cog_summary.xlsx \
  --kegg_annotations $WORK_DIR/Functional_Annotation/kegg_summary.xlsx \
  --annotation_type auto \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

# Simplified version with only CAZyme annotations
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

ℹ️ Optional Parameters

--novel_mags_file: Text file with novel MAG IDs (one per line)
--functional_annotations: Path to annotation directory (auto-detected)
--annotation_type: auto, eggnog, kegg, or dbcan
Can provide any combination of cazyme, cog, and kegg annotations

🔧 R Environment Requirements

The visualization steps require R with specific packages:

Required: ape, RColorBrewer, jsonlite
Optional (enhanced): ggtree, ggplot2

Install R packages:

install.packages(c("ape", "RColorBrewer", "jsonlite"))
# For enhanced visualizations (optional):
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")

💡 Best Practices

Always run GTDB-Tk classification before tree generation
Use dereplicated MAGs for cleaner trees
Choose outgroup taxon from a distantly related phylum
Run functional annotation steps before enhanced visualization
Create novel_mags_list.txt from identify_novel_mags output
Use higher memory (32-64GB) for large trees (>100 MAGs)

bash

# Step 6.5: Phylogenetic Tree (Optional)
echo "Generating phylogenetic tree..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps mags_tree \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --outgroup_taxon "p__Firmicutes" \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

echo "Creating enhanced tree visualizations..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

Complete Pipeline Script

ℹ️ Comprehensive Pipeline Script

This script includes all pipeline steps with proper parameters. Uncomment the sections you need and adjust paths/parameters for your project.

bash

#!/bin/bash
#SBATCH -p ghpc
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=200G
#SBATCH -t 15-00:00:00
#SBATCH --output=pipeline_complete_%j.out
#SBATCH --error=pipeline_complete_%j.err

# ============================================================================
# Complete MetaMAG Explorer Pipeline
# Version: 1.0.0
# ============================================================================

# CRITICAL: Set environment
CONDA_ENV="/your/path/miniconda3/bin/activate metamag"
METAMAG_DIR="/path/to/MetaMAG-1.0.0"
cd $METAMAG_DIR
source $CONDA_ENV

# === Core Variables (MODIFY THESE) ===
WORK_DIR="/path/to/project"
SAMPLES="/path/to/samples.txt"
PROJECT_CONFIG="/path/to/project_config.yaml"
REFERENCE="/path/to/host_genome.fa"
LOG_DIR="$WORK_DIR/logs"
mkdir -p $LOG_DIR

# === Database Paths ===
KRAKEN_DB="/path/to/kraken_db"
KEGG_DB="/path/to/ko.txt"
GTDBTK_DB="/path/to/gtdb/release"

# === Optional: Rumen-specific paths ===
RUMEN_REF_DIR="/path/to/rumen/reference_mags"
RUMEN_ADDED_DIR="/path/to/rumen/added_mags"
DATASET_DIR="/path/to/store/reference_mags"

echo "=========================================="
echo "Starting MetaMAG Explorer Pipeline"
echo "Project: $WORK_DIR"
echo "Date: $(date)"
echo "=========================================="

# ============================================================================
# PHASE 1: QUALITY CONTROL & PREPROCESSING
# ============================================================================

# Step 1.1: Quality Control
echo "[$(date)] Step 1.1: Running QC..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps qc \
  --batch_size 20 \
  --cpus 4 \
  --memory "8G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 1.2: MultiQC Report
echo "[$(date)] Step 1.2: Running MultiQC..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps multiqc \
  --input_dir $WORK_DIR/QC \
  --batch_size 1 \
  --cpus 2 \
  --memory "4G" \
  --time "0-00:30:00" \
  --log_dir $LOG_DIR

# Step 2.1: Trimming
echo "[$(date)] Step 2.1: Running trimming..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps trimming \
  --batch_size 10 \
  --cpus 8 \
  --memory "16G" \
  --time "0-04:00:00" \
  --log_dir $LOG_DIR

# Step 2.2: Host Removal
echo "[$(date)] Step 2.2: Running host removal..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps host_removal \
  --reference $REFERENCE \
  --batch_size 10 \
  --cpus 16 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 2: ASSEMBLY
# ============================================================================

# Step 3.1: Single Assembly (IDBA-UD)
echo "[$(date)] Step 3.1: Running single assembly..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_assembly \
  --batch_size 5 \
  --cpus 32 \
  --memory "100G" \
  --time "1-00:00:00" \
  --log_dir $LOG_DIR

# Step 3.2: Co-Assembly (MEGAHIT)
echo "[$(date)] Step 3.2: Running co-assembly..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps co_assembly \
  --batch_size 50 \
  --cpus 64 \
  --memory "200G" \
  --time "2-00:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 3: BINNING & REFINEMENT
# ============================================================================

# Step 4.1: Binning
echo "[$(date)] Step 4.1: Running binning..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_binning \
  --binning_methods metabat2 maxbin2 concoct \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir $LOG_DIR

# Step 4.2: Bin Refinement with DAS Tool
echo "[$(date)] Step 4.2: Running bin refinement..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_bin_refinement \
  --score_threshold 0.5 \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 4: QUALITY ASSESSMENT & DEREPLICATION
# ============================================================================

# Step 5.1: Quality Assessment with CheckM
echo "[$(date)] Step 5.1: Running CheckM evaluation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# Step 5.2: Dereplication with dRep
echo "[$(date)] Step 5.2: Running dRep..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps dRep \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 5: TAXONOMIC CLASSIFICATION
# ============================================================================

# Step 6: GTDB-Tk Classification
echo "[$(date)] Step 6: Running GTDB-Tk..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps gtdbtk \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 6: RUMEN REFERENCE MAGs (OPTIONAL - for rumen microbiome projects)
# ============================================================================

# # Step 7.1: Download Rumen Reference MAGs
# echo "[$(date)] Step 7.1: Downloading rumen reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps rumen_refmags_download \
#   --dataset_dir $DATASET_DIR \
#   --batch_size 1 \
#   --cpus 4 \
#   --memory "16G" \
#   --time "0-12:00:00" \
#   --log_dir $LOG_DIR

# # Step 7.2: Dereplicate Reference MAGs
# echo "[$(date)] Step 7.2: Dereplicating reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps rumen_refmags_drep \
#   --dataset_dir $DATASET_DIR \
#   --drep_output_dir $RUMEN_REF_DIR/drep_output \
#   --batch_size 1 \
#   --cpus 32 \
#   --memory "100G" \
#   --time "1-00:00:00" \
#   --log_dir $LOG_DIR

# # Step 7.3: Classify Reference MAGs
# echo "[$(date)] Step 7.3: Classifying reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps rumen_refmags_gtdbtk \
#   --genome_dir $RUMEN_REF_DIR/drep_output/dereplicated_genomes \
#   --gtdbtk_out_dir $RUMEN_REF_DIR/gtdbtk_output \
#   --novel_output_dir $RUMEN_REF_DIR/novel_mags \
#   --genome_extension ".fa" \
#   --batch_size 1 \
#   --cpus 64 \
#   --memory "200G" \
#   --time "0-06:00:00" \
#   --log_dir $LOG_DIR

# # Step 7.4: Integrate with Project MAGs
# echo "[$(date)] Step 7.4: Integrating with reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --samples-file $SAMPLES \
#   --steps rumen_drep \
#   --ref_mags_dir $RUMEN_REF_DIR/drep_output/dereplicated_genomes \
#   --batch_size 1 \
#   --cpus 32 \
#   --memory "100G" \
#   --time "0-12:00:00" \
#   --log_dir $LOG_DIR

# ============================================================================
# PHASE 7: NOVEL MAG PROCESSING
# ============================================================================

# Step 8.1: Identify Novel MAGs
echo "[$(date)] Step 8.1: Identifying novel MAGs..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps identify_novel_mags \
  --custom_gtdbtk_dir $WORK_DIR/Novel_Mags/gtdbtk \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 8.2: Process Novel MAGs (General data)
echo "[$(date)] Step 8.2: Processing novel MAGs..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps process_novel_mags \
  --kraken_db_path $KRAKEN_DB \
  --merge_mode \
  --batch_size 1 \
  --cpus 64 \
  --memory "128G" \
  --time "0-12:00:00" \
  --log_dir $LOG_DIR

# # For rumen data, use this instead:
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps process_novel_mags \
#   --is_rumen_data \
#   --rumen_ref_mags_dir $RUMEN_REF_DIR \
#   --rumen_added_mags_dir $RUMEN_ADDED_DIR \
#   --kraken_db_path $KRAKEN_DB \
#   --merge_mode \
#   --batch_size 1 \
#   --cpus 64 \
#   --memory "128G" \
#   --time "0-12:00:00" \
#   --log_dir $LOG_DIR

# Step 8.3: Add MAGs to Repository
echo "[$(date)] Step 8.3: Adding MAGs to repository..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps add_mags_to_repo \
  --batch_size 1 \
  --cpus 8 \
  --memory "16G" \
  --time "0-01:00:00" \
  --log_dir $LOG_DIR

# Step 8.4: Build/Update Kraken Database
echo "[$(date)] Step 8.4: Building Kraken database..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps build_kraken_db \
  --kraken_db_path $KRAKEN_DB \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "1-00:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 8: FUNCTIONAL ANNOTATION
# ============================================================================

# Step 9.1: EggNOG Annotation
echo "[$(date)] Step 9.1: Running EggNOG annotation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps eggnog_annotation \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --batch_size 1 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir $LOG_DIR

# Step 9.2: dbCAN CAZyme Annotation
echo "[$(date)] Step 9.2: Running dbCAN annotation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps dbcan_annotation \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --batch_size 1 \
  --cpus 32 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# Step 9.3: Integrated Functional Analysis
echo "[$(date)] Step 9.3: Running functional analysis..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps functional_analysis \
  --kegg_db_file $KEGG_DB \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 9.4: Advanced Visualizations
echo "[$(date)] Step 9.4: Creating advanced visualizations..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps advanced_visualizations \
  --viz_output_suffix "_final" \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 9: ABUNDANCE ESTIMATION
# ============================================================================

# Step 10: Comprehensive Abundance Estimation
echo "[$(date)] Step 10: Running abundance estimation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps abundance_estimation \
  --kraken_db_path $KRAKEN_DB \
  --taxonomic_levels S G F \
  --read_length 150 \
  --threshold 10 \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# # With metadata for group comparisons:
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --samples-file $SAMPLES \
#   --steps abundance_estimation \
#   --kraken_db_path $KRAKEN_DB \
#   --taxonomic_levels S G F \
#   --metadata_file $WORK_DIR/metadata.csv \
#   --batch_size 50 \
#   --cpus 32 \
#   --memory "64G" \
#   --time "0-06:00:00" \
#   --log_dir $LOG_DIR

# ============================================================================
# PHASE 10: PHYLOGENETIC ANALYSIS (OPTIONAL)
# ============================================================================

# Step 11.1: Generate Phylogenetic Tree
echo "[$(date)] Step 11.1: Generating phylogenetic tree..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps mags_tree \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --outgroup_taxon "p__Firmicutes" \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# Step 11.2: Basic Tree Visualization
echo "[$(date)] Step 11.2: Creating basic tree visualization..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 11.3: Enhanced Tree Visualization with Annotations
echo "[$(date)] Step 11.3: Creating enhanced tree visualizations..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --cog_annotations $WORK_DIR/Functional_Annotation/cog_summary.xlsx \
  --kegg_annotations $WORK_DIR/Functional_Annotation/kegg_summary.xlsx \
  --annotation_type auto \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PIPELINE COMPLETE
# ============================================================================

echo "=========================================="
echo "MetaMAG Explorer Pipeline Complete!"
echo "End time: $(date)"
echo "Output directory: $WORK_DIR"
echo ""
echo "Key outputs:"
echo "  - QC reports: $WORK_DIR/QC"
echo "  - Assemblies: $WORK_DIR/Assembly"
echo "  - MAGs: $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes"
echo "  - Taxonomy: $WORK_DIR/Novel_Mags/gtdbtk"
echo "  - Annotations: $WORK_DIR/Functional_Annotation"
echo "  - Abundances: $WORK_DIR/Kraken_Abundance"
echo "  - Visualizations: $WORK_DIR/Advanced_visualizations_final"
echo "=========================================="

✅ Script Usage Tips

Comment out steps you don't need (use # at the beginning of lines)
Uncomment optional sections like rumen processing if needed
Adjust SLURM parameters (#SBATCH lines) based on your cluster
Modify resource allocations (cpus, memory, time) per step as needed
Check that all database paths are correctly set before running
Consider running phases separately for better monitoring

💡 Recommended Workflow Execution

Phase 1-2: QC, Preprocessing, Assembly (1-2 days)
Phase 3-4: Binning, Quality Assessment (1 day)
Phase 5-7: Taxonomy, Novel MAGs (1 day)
Phase 8: Functional Annotation (1 day)
Phase 9-10: Abundance, Phylogenetics (1 day)

Total estimated time: 5-7 days for a complete run with 50-100 samples

Troubleshooting

💡 Common Troubleshooting Steps

Check log files in $LOG_DIR
Verify input data quality
Ensure all dependencies are installed
Adjust computational resources as needed

← Installation Guide View on GitHub →