🚀 Quick Start: Run the Pipeline

If you're looking for the fastest way to execute the pipeline:

Use the provided run.sh script.

It's preconfigured to handle most parameters, and you only need to update your file paths and step names.

What You Need To Do:

  1. Open run.sh in a text editor.
  2. Update these key variables:
    • WORK_DIR -- your project directory
    • SAMPLES -- path to your sample list file
    • PROJECT_CONFIG -- your YAML config
    • STEPS -- choose which step(s) to run
    • CONDA_ENV -- path to your conda environment
    • Other relevant paths (e.g., reference genome, Kraken DB)
  3. Submit the job to SLURM:
    bash
    sbatch run.sh

â„šī¸Note

Alternatively, if you'd like to run steps in a more customized or incremental way, refer to the detailed instructions below.

SLURM Setup Example

Here's a complete example of a SLURM submission script with all necessary components:

bash
#!/bin/bash
#SBATCH -p ghpc                    # Partition/queue name
#SBATCH -N 1                       # Number of nodes
#SBATCH -n 1                       # Number of tasks
#SBATCH --mem=10G                  # Memory allocation
#SBATCH -t 2:30:00                 # Time limit (HH:MM:SS)
#SBATCH --output=pipeline_meta_%j.out    # Standard output log
#SBATCH --error=pipeline_meta_%j.err     # Standard error log

# CRITICAL: Set your conda environment path (adjust to YOUR installation)
CONDA_ENV="/path/conda/activate metamag"

# Navigate to MetaMAG directory
cd /path/MetaMAG-1.0.0

# Activate conda environment
source $CONDA_ENV

# Run the pipeline with specified step
python3 -m MetaMAG.main \
  --project_config plant_project_config.yaml \
  --samples-file samples.txt \
  --batch_size 16 \
  --cpus 1 \
  --memory 20G \
  --time 15-12:00:00 \
  --log_dir ./logs/plant/qc \
  --steps qc

âš ī¸Adjust SLURM Parameters Based on Your Step

Different pipeline steps require different resources:

  • QC: Low resources (4 CPUs, 8G memory, 2 hours)
  • Trimming: Medium resources (8 CPUs, 16G memory, 4 hours)
  • Assembly: High resources (32-64 CPUs, 100-200G memory, 1-2 days)
  • Binning: High resources (32 CPUs, 64G memory, 12 hours)
  • GTDB-Tk: Very high memory (64 CPUs, 200G memory, 6 hours)

Understanding Batch Size Parameter

â„šī¸ Critical: How Batch Size Works

The --batch_size parameter determines how many samples are processed together in a single job.

  • If you have 20 samples and set --batch_size 20 → 1 job processes all 20 samples
  • If you have 20 samples and set --batch_size 10 → 2 parallel jobs, each processing 10 samples
  • If you have 20 samples and set --batch_size 1 → 20 parallel jobs, each processing 1 sample

Steps That Must Run as Single Batch

The following steps process all data together and should always use a batch_size equal to your total sample count:

Single Batch Steps (set batch_size = total samples)
  • multiqc - Combines all QC reports into one
  • co_assembly - Combines all samples for assembly
  • evaluation - Evaluates all MAGs together with CheckM
  • dRep - Dereplicates all MAGs as a set
  • gtdbtk - Classifies all MAGs in one run
  • all Novel MAG steps - Process novel MAGs as a complete set
  • all Functional Annotation steps - Annotate all MAGs together
  • abundance_estimation - Processes all samples but in single workflow
  • all Phylogenetic Tree steps - Build tree from all MAGs
  • all Visualization steps - Generate plots from complete data

Steps That Can Be Parallelized

These steps process individual samples and benefit from smaller batch sizes for parallel execution:

Parallelizable Steps (adjust batch_size for parallelization)
  • qc - Each sample's QC can run independently
  • trimming - Each sample trimmed separately
  • host_removal - Each sample processed independently
  • single_assembly - Each sample assembled separately (use batch_size 1-2 due to high memory)
  • single_binning - Each sample's assembly binned separately
  • single_bin_refinement - Each sample's bins refined separately
  • kraken_abundance - Each sample classified independently (but NOT abundance_estimation)

Recommended Batch Size Strategy

✅ General Guidelines

For a project with N samples:

  • Light steps (QC, trimming): batch_size = N/4 to N/2 (run 2-4 parallel jobs)
  • Memory-intensive steps (assembly, binning): batch_size = 1 to 2 (maximize parallel jobs)
  • Single-batch steps: batch_size = N or higher (run as one job)
  • Cluster limits: Consider your cluster's max job submission limit

âš ī¸ Memory Consideration

Remember: Lower batch_size means more parallel jobs but also more total memory usage across all jobs. For memory-intensive steps like assembly, use batch_size 1-2 even if it means many parallel jobs.

File Naming Requirements

📁 Input File Naming Conventions

Files must follow these naming patterns:

  • {sample}_1.fastq.gz and {sample}_2.fastq.gz
  • {sample}_R1.fastq.gz and {sample}_R2.fastq.gz
  • {sample}_forward.fastq.gz and {sample}_reverse.fastq.gz

Configuration Setup

Project Configuration File (ONLY 3 LINES NEEDED)

yaml
# project_config.yaml - ONLY THESE 3 LINES
input_dir: "/path/to/raw/reads"
output_dir: "/path/to/output"
reference: "/path/to/host/genome.fa"  # Optional for host removal

Sample List Format

text
# samples.txt
Sample_001
Sample_002
Sample_003

Step 1: Quality Control

What it does:
  • Runs initial quality assessment on raw sequencing reads
  • Generates quality reports using FastQC
Key Outputs:
  • Quality control reports
  • Sequence quality metrics
Step 1a: Run FastQC
bash
# CORRECT COMMAND FORMAT
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps qc \
  --batch_size 20 \
  --cpus 4 \
  --memory "8G" \
  --time "0-02:00:00" \
  --log_dir ./logs
Step 1b: Generate MultiQC Report
bash
# No samples file needed for MultiQC
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps multiqc \
  --input_dir $WORK_DIR/QC \
  --batch_size 1 \
  --cpus 2 \
  --memory "4G" \
  --time "0-00:30:00" \
  --log_dir ./logs

Step 2: Preprocessing

Step 2a: Trimming (Fastp)
What it does:
  • Removes low-quality bases
  • Trims adapter sequences
  • Filters out poor-quality reads
Key Outputs:
  • Cleaned read files
  • Trimming statistics

Hardcoded parameters: Q20 quality, 30bp minimum length

bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps trimming \
  --batch_size 10 \
  --cpus 8 \
  --memory "16G" \
  --time "0-04:00:00" \
  --log_dir ./logs
Step 2b: Host Removal
What it does:
  • Removes host-derived sequences
  • Maps reads against reference genome
  • Extracts non-host (microbial) reads
Key Outputs:
  • Host-free metagenomic reads
  • Mapping statistics
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps host_removal \
  --reference $REFERENCE \
  --batch_size 10 \
  --cpus 16 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Step 3: Assembly

Step 3a: Single Assembly (IDBA-UD)
What it does:
  • Assembles reads for individual samples
  • Generates contigs using IDBA-UD assembler
Key Outputs:
  • Assembled contigs
  • Assembly statistics
bash
# ALWAYS uses IDBA-UD (no --assembler parameter)
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_assembly \
  --batch_size 5 \
  --cpus 32 \
  --memory "100G" \
  --time "1-00:00:00" \
  --log_dir ./logs
Step 3b: Co-Assembly (MEGAHIT)
What it does:
  • Combines reads from multiple samples
  • Generates a unified assembly using MEGAHIT
Key Outputs:
  • Co-assembled contigs
  • Merged assembly statistics

Hardcoded: k-mer 21-141, step 12

bash
# ALWAYS uses MEGAHIT with k-mer 21-141
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps co_assembly \
  --batch_size 50 \
  --cpus 64 \
  --memory "200G" \
  --time "2-00:00:00" \
  --log_dir ./logs

Step 4: Binning & Refinement

Step 4a: Binning
What it does:
  • Clusters contigs into potential genome bins
  • Uses multiple binning algorithms (MetaBAT2, MaxBin2, CONCOCT)
Key Outputs:
  • Initial genome bins
  • Binning statistics
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_binning \
  --binning_methods metabat2 maxbin2 concoct \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir ./logs
Step 4b: Refinement with DAS Tool
What it does:
  • Refines and merges bins from multiple methods
  • Improves bin quality using DAS Tool
  • Filters bins to process only valid naming patterns
Key Outputs:
  • Refined, high-quality genome bins
  • DAS Tool quality scores
  • Filtered bin directories
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_bin_refinement \
  --score_threshold 0.5 \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

â„šī¸ DAS Tool Processing Notes

  • Score threshold: 0.5 (can be adjusted with --score_threshold)
  • Automatically handles missing or empty bin directories
  • Creates TSV mapping files for each binning method
  • Continues pipeline even if DAS Tool fails due to low-quality bins

Step 5: Quality Assessment & Dereplication

Step 5a: Quality Assessment with CheckM
What it does:
  • Runs CheckM lineage workflow to assess genome quality
  • Estimates completeness using lineage-specific marker genes
  • Calculates contamination based on duplicated markers
  • Generates detailed quality metrics and visualization plots
CheckM Workflow Details:
  • Method: lineage_wf (lineage-specific workflow)
  • Marker set: Automatically selected based on taxonomic placement
  • Extension: .fa files
Key Outputs:
  • CheckM Raw Outputs:
    • lineage.ms - Marker lineage information
    • storage/bin_stats_ext.tsv - Raw CheckM statistics
    • checkm_output.log - CheckM execution log
  • Processed Outputs:
    • cleaned_checkm_output.csv - Formatted quality metrics table
    • plots/completeness_contamination_plot.pdf - Bubble plot (size = genome size)
    • plots/genome_quality_plot.pdf - Quality categories with histograms
    • plots/scaffolds_histogram.pdf - Scaffold count distribution

â„šī¸ CheckM Processing Notes

  • CheckM lineage_wf is used (not qa or taxonomy_wf)
  • Automatically processes bin_stats_ext.tsv to create cleaned CSV
  • Skips execution if all outputs exist (delete output dir to re-run)
  • Quality categories:
    • Near Complete: â‰Ĩ90% complete, ≤5% contamination
    • Medium Quality: â‰Ĩ70% complete, ≤10% contamination
    • Partial: â‰Ĩ50% complete, ≤4% contamination
Run CheckM on dRep output (default):
bash
# Runs CheckM on dereplicated genomes from dRep
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs
Run CheckM on custom genome directory:
bash
# Evaluate genomes from a custom directory
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation \
  --evaluation_input_dir /path/to/custom/genomes \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs
Step 5b: Dereplication with dRep
What it does:
  • Removes redundant MAGs across samples
  • Clusters genomes based on ANI similarity
  • Selects representative genome from each cluster
Key Outputs:
  • dereplicated_genomes/ - Non-redundant MAG set
  • data_tables/genomeInfo.csv - Quality and clustering information
  • passing_genomes.csv - List of genomes passing QC (if recovery mode activated)

â„šī¸ dRep Configuration (Hardcoded Parameters)

  • Minimum completeness: 80%
  • Maximum contamination: 10%
  • Strain heterogeneity: 100
  • ANI method: fastANI
  • Secondary clustering ANI: 95%

Automatic Recovery: If dRep fails due to insufficient genomes passing quality thresholds, the pipeline automatically identifies and copies all genomes meeting the thresholds to a 'dereplicated_genomes' directory. A 'passing_genomes.csv' file is generated listing all genomes that passed QC.

Bin Renaming: All bins are automatically renamed with their source prefix (e.g., Sample_001_bin.1.fa for single samples, coassembly_bin.1.fa for co-assembly).

Run CheckM + dRep together:
bash
# Run both CheckM evaluation and dRep dereplication
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation dRep \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs

Step 6: Taxonomic Classification (GTDB-Tk)

What it does:
  • Assigns taxonomic classifications to MAGs
  • Uses Genome Taxonomy Database (GTDB) for classification
  • Runs classify_wf workflow only (not de novo workflow)
Key Outputs:
  • classify/gtdbtk.bac120.summary.tsv - Bacterial classifications
  • classify/gtdbtk.ar53.summary.tsv - Archaeal classifications
  • Phylogenetic placement information

â„šī¸ GTDB-Tk Path Configuration

Input Path (Fixed):

{project_output}/Bin_Refinement/drep/dRep_output/dereplicated_genomes

Output Path (Fixed):

{project_output}/Novel_Mags/gtdbtk

Processing Notes:

  • ANI screening is disabled (--skip_ani_screen) for faster processing
  • Only runs taxonomic classification, not phylogenetic tree inference
  • File extension expected: .fa
  • Automatically creates output directory structure
bash
# NO samples file needed
# Paths are automatically determined from project config
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps gtdbtk \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "0-06:00:00" \
  --log_dir ./logs

âš ī¸ Memory Requirements

GTDB-Tk requires substantial memory (~200GB) due to the reference database size. Ensure adequate resources are allocated. The tool may fail with insufficient memory without clear error messages.

Step 7: Rumen Reference MAGs Processing (Optional)

â„šī¸ For Rumen Microbiome Projects

These steps are specifically designed for rumen microbiome studies. They integrate high-quality reference genomes from major rumen microbiome databases (RGMGC, MGnify, RUG).

Step 7.1: Download Reference MAGs

What it does:
  • Downloads reference MAGs from three major sources:
    • RGMGC: 10,373 genomes from ENA (European Nucleotide Archive)
    • MGnify: 2,729 genomes from cow rumen species catalogue
    • RUG: 4,941 genomes from Rumen Uncultured Genomes dataset
  • Automatically handles download retries and error recovery
  • Processes downloaded files (extracts .gz, standardizes to .fa extension)
  • Total: ~18,043 reference genomes if all downloaded successfully
Requirements:
  • Internet connection for downloading
  • ~50-100GB disk space
  • dataset_file_list.txt and ena_mags_list.tsv files in script directory
Key Outputs:
  • All files standardized with .fa extension in single directory
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps rumen_refmags_download \
  --dataset_dir /path/to/store/reference_mags \
  --batch_size 1 \
  --cpus 4 \
  --memory "16G" \
  --time "0-12:00:00" \
  --log_dir ./logs
â„šī¸

The download process checks for existing files and skips already downloaded genomes, making it resumable if interrupted.

Step 7.2: Dereplicate Reference MAGs

What it does:
  • Uses dRep to dereplicate the downloaded reference MAG collection
  • Removes redundant genomes based on ANI thresholds
  • Selects highest quality representative from each cluster
dRep Parameters (Hardcoded):
  • Completeness threshold: 80%
  • Contamination threshold: 10%
  • ANI method: fastANI
  • Secondary clustering: 95% ANI
Key Outputs:
  • dereplicated_genomes/ - Representative genomes
  • data_tables/genomeInfo.csv - Quality and clustering information
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps rumen_refmags_drep \
  --dataset_dir /path/to/downloaded/reference_mags \
  --drep_output_dir /path/to/dereplicated_output \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "1-00:00:00" \
  --log_dir ./logs

Step 7.3: Classify Reference MAGs with GTDB-Tk

What it does:
  • Runs GTDB-Tk classify_wf on reference MAGs
  • Identifies novel MAGs (without species-level classification)
  • Processes bacterial and archaeal domains separately
  • Can skip GTDB-Tk if already run (--only_novel flag)
Key Outputs:
  • classify/gtdbtk.bac120.summary.tsv - Bacterial classifications
  • classify/gtdbtk.ar53.summary.tsv - Archaeal classifications
  • novel_rumenref_mags/mags/ - Novel MAG files
  • novel_rumenref_mags/taxonomy/ - Novel MAG taxonomy files
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps rumen_refmags_gtdbtk \
  --genome_dir /path/to/dereplicated_genomes \
  --gtdbtk_out_dir /path/to/gtdbtk_output \
  --novel_output_dir /path/to/novel_mags_output \
  --genome_extension ".fa" \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "0-06:00:00" \
  --log_dir ./logs

# To skip GTDB-Tk if already run (only identify novel MAGs):
# Add --only_novel flag

Step 7.4: Integrate Project MAGs with Reference MAGs

What it does:
  • Combines your project MAGs with reference MAGs
  • Performs co-dereplication of combined set
  • Identifies which project MAGs are truly novel
  • Creates unified non-redundant collection
Key Outputs:
  • Novel_Mags/UniqueProjMags/ - Project-specific unique MAGs
  • Novel_Mags/UniqueRefMags/ - Reference-specific unique MAGs
  • uniqueness_summary.txt - Dereplication statistics
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps rumen_drep \
  --ref_mags_dir /path/to/reference/dereplicated_genomes \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-12:00:00" \
  --log_dir ./logs

Step 8: Novel MAG Processing

Step 8.1: Identify Novel MAGs

What it does:
  • Analyzes GTDB-Tk results to find MAGs without species designation
  • Extracts novel genome candidates from classification
  • Copies novel MAGs to dedicated directory
Key Outputs:
  • Novel_Mags/UniqueMags/ - Candidate novel MAGs
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps identify_novel_mags \
  --custom_gtdbtk_dir /path/to/gtdbtk/results \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

Step 8.2: Process Novel MAGs (Complete Pipeline)

What it does:
  • Comprehensive pipeline for novel MAG processing
  • Dereplicates against existing repository
  • For rumen data: dereplicates against reference MAGs
  • Adds to MAGs repository
  • Builds/updates Kraken database
Required for Rumen Data:
  • --is_rumen_data flag
  • --rumen_ref_mags_dir (reference MAGs)
  • --rumen_added_mags_dir (previously added MAGs)
Key Outputs:
  • Novel_Mags/filtered_NMAGs/ - Post-repository dereplication
  • Novel_Mags/true_novel_MAGs/ - Final novel MAGs
  • MAGs_Repository/ - Updated repository
  • Kraken_Database/ - Updated Kraken2 database
bash
# For general data:
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps process_novel_mags \
  --kraken_db_path /path/to/kraken_db \
  --merge_mode \
  --batch_size 1 \
  --cpus 64 \
  --memory "128G" \
  --time "0-12:00:00" \
  --log_dir ./logs

# For rumen data:
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps process_novel_mags \
  --is_rumen_data \
  --rumen_ref_mags_dir /path/to/rumen/reference/mags \
  --rumen_added_mags_dir /path/to/rumen/added/mags \
  --kraken_db_path /path/to/kraken_db \
  --merge_mode \
  --batch_size 1 \
  --cpus 64 \
  --memory "128G" \
  --time "0-12:00:00" \
  --log_dir ./logs

âš ī¸ Merge Mode

--merge_mode: Updates existing Kraken database (recommended)

--no_merge: Creates fresh database (use carefully)

Step 8.3: Add MAGs to Repository

What it does:
  • Adds validated novel MAGs to central repository
  • Maintains collection for future reuse
Key Outputs:
  • MAGs_Repository/ - Updated with new MAGs
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps add_mags_to_repo \
  --batch_size 1 \
  --cpus 8 \
  --memory "16G" \
  --time "0-01:00:00" \
  --log_dir ./logs

Step 8.4: Build Kraken Database

What it does:
  • Creates/updates Kraken2 database with novel MAGs
  • Uses advanced taxonomy resolver for conflicts
  • Builds Bracken database for abundance estimation
Processing Steps:
  1. Placeholder taxa assignment
  2. GTDB to Taxdump conversion
  3. Header renaming for Kraken
  4. Add to library
  5. Build Kraken database
  6. Build Bracken database
  7. Build distribution
Key Outputs:
  • Kraken_Database/taxonomy/ - nodes.dmp, names.dmp, taxid.map
  • Kraken_Database/library/ - MAG sequences
  • Kraken_Database/*.k2d - Database files
  • Bracken database files for abundance estimation
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps build_kraken_db \
  --kraken_db_path /path/to/kraken_db \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "1-00:00:00" \
  --log_dir ./logs

â„šī¸ Required Scripts

The pipeline requires these scripts in the main directory:

  • gtdb_to_taxdump_latest_resolve.py - Advanced taxonomy resolver
  • header_kraken.py - Header renaming script

Recommended Workflow for Rumen Microbiome Data

Download References (7.1) → Dereplicate References (7.2) → Classify References (7.3) → Process Your Samples (Steps 1-6) → Integrate with References (7.4) → Identify Novel MAGs (8.1) → Process Novel MAGs (8.2) → Build Database (8.4) → Continue Pipeline (Steps 9+)

Step 9: Functional Annotation

â„šī¸ Comprehensive Functional Analysis Pipeline

The functional annotation pipeline includes three main components: EggNOG annotation for orthology groups, dbCAN for CAZyme identification, and integrated functional analysis with visualization.

Step 9.1: EggNOG Annotation

What it does:
  • Predicts proteins from MAGs using Prodigal (if not already done)
  • Annotates genes with functional information using EggNOG-mapper
  • Uses Diamond for fast alignment (falls back to HMMER if needed)
  • Extracts KO terms, COG categories, and CAZy annotations
  • Creates temporary database in /dev/shm for performance
Key Outputs:
  • eggnog_output/*.emapper.annotations - Raw EggNOG annotations
  • functional_annotations/KOs/ - KEGG Orthology terms
  • functional_annotations/COGs/ - COG categories
  • functional_annotations/CAZy/ - CAZyme annotations
Configuration Requirements:
  • eggnog_mapper path in config
  • eggnog_db_dir path in config
  • prodigal path in config
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps eggnog_annotation \
  --mags_dir /path/to/final_mags \
  --batch_size 1 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir ./logs

Step 9.2: dbCAN CAZyme Annotation

What it does:
  • Identifies Carbohydrate-Active Enzymes (CAZymes)
  • Predicts proteins if needed (shares with EggNOG)
  • Runs dbCAN2 with HMMER for CAZyme detection
  • Summarizes CAZyme families and categories
  • Creates visualization plots automatically
  • Groups CAZymes by substrate specificity
Key Outputs:
  • dbcan_output/*_dbcan_results/ - dbCAN results per MAG
  • cazyme_annotations/cazyme_counts/ - Count matrices
  • cazyme_annotations/*.xlsx - Excel summaries
  • cazyme_annotations/plots/ - Visualization plots:
    • CAZyme heatmaps
    • Category distributions
    • Substrate-specific groupings
    • Dendrogram clustering
Configuration Requirements:
  • dbcan path in config
  • dbcan_db_dir path in config
  • dbcan_activate conda environment path (optional)
  • prodigal path in config
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps dbcan_annotation \
  --mags_dir /path/to/final_mags \
  --batch_size 1 \
  --cpus 32 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir ./logs

â„šī¸ Automatic Visualizations

dbCAN automatically generates multiple plots including heatmaps, category distributions, substrate groupings, and clustering dendrograms. These require matplotlib, seaborn, and pandas to be installed.

Step 9.3: Integrated Functional Analysis

What it does:
  • Integrates results from EggNOG and dbCAN
  • Maps KO terms to KEGG pathway hierarchy
  • Creates comprehensive functional profiles
  • Generates advanced visualizations
  • Produces HTML summary report
  • Optional species-based analysis
Key Outputs:
  • functional_analysis/ - Main analysis directory with:
    • files/ko/ - KO term matrices
    • files/cog/ - COG category matrices
    • files/kegg/ - KEGG pathway mappings
    • plots/ - Various visualization types:
      • COG main category distributions
      • KEGG Sankey diagrams
      • Network visualizations
      • Hierarchical clustered heatmaps
      • Circular plots
      • Treemaps and bubble plots
    • reports/functional_analysis_report.html - Interactive HTML report
Required Files:
  • ko.txt - KEGG database file (optional but recommended)
  • EggNOG annotation results from Step 9.1
bash
# Basic functional analysis
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps functional_analysis \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

# With KEGG database file for pathway mapping
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps functional_analysis \
  --kegg_db_file /path/to/ko.txt \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

âš ī¸ KEGG Database File

The ko.txt file enables KEGG pathway hierarchy mapping. Without it, the analysis will skip pathway-level visualizations but still process COG and KO annotations.

Step 9.4: Advanced Visualizations

Prerequisites:
  • ✅ Must complete Step 9.1 (EggNOG annotation) first
  • ✅ Must complete Step 9.3 (Functional Analysis) first
  • Optional: Step 9.2 (dbCAN) for CAZyme visualizations
What it does:
  • Creates comprehensive publication-quality visualizations
  • Generates two complete sets: "All MAGs" and "Novel MAGs Only"
  • Produces 8+ visualization types automatically:
    • CAZyme heatmaps (clustered and phylum-grouped)
    • COG functional Sankey diagrams
    • KEGG pathway Sankey diagrams
    • KEGG metabolism detailed flows
    • Functional PCA analyses
    • Network visualizations with pie charts
    • Integrated multi-omics views
  • Automatically handles MAG naming convention differences
  • Limits some visualizations to 30-40 MAGs for clarity
Key Outputs:
  • Advanced_visualizations{suffix}/ - Main directory with all MAGs
  • Advanced_visualizations{suffix}/Novel_MAGs_Only/ - Novel MAGs subset
  • Output formats: PDF, PNG, SVG, and interactive HTML
  • Word-optimized versions with larger fonts (*_WORD_COMPACT.png)
Required Python Libraries:
  • matplotlib, seaborn, pandas, numpy
  • scipy, scikit-learn
  • plotly (optional - for interactive plots)
  • networkx (optional - for network analysis)
bash
# Run advanced visualizations
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps advanced_visualizations \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir ./logs

# With output suffix for organization
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps advanced_visualizations \
  --viz_output_suffix "_final" \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir ./logs

â„šī¸ Processing Notes

  • Runs all visualizations automatically in two phases
  • Phase 1: Analyzes all MAGs in the project
  • Phase 2: Creates separate visualizations for novel MAGs only
  • Use --viz_output_suffix to organize multiple runs
  • Large projects may take 2-4 hours to complete

Running Complete Functional Annotation Pipeline

bash
# Run all functional annotation steps sequentially
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps eggnog_annotation dbcan_annotation functional_analysis \
  --mags_dir /path/to/final_mags \
  --kegg_db_file /path/to/ko.txt \
  --batch_size 1 \
  --cpus 32 \
  --memory "64G" \
  --time "0-18:00:00" \
  --log_dir ./logs

Step 10: Abundance Estimation

â„šī¸ Comprehensive Abundance Analysis Pipeline

This step performs taxonomic classification and abundance estimation using your custom Kraken2 database (with novel MAGs) and Bracken for accurate abundance quantification.

Prerequisites

Required Before Running:
  • ✅ Host removal completed (Step 2b)
  • ✅ Kraken2 database built (Step 8.4)
  • ✅ Bracken database files exist (built with Step 8.4)

Option 1: Single Sample Kraken Classification

What it does:
  • Runs Kraken2 classification on individual samples
  • Generates per-sample classification reports
  • Simple classification without abundance refinement
Key Outputs:
  • Kraken_Abundance/{sample}_kraken.txt - Raw classifications
  • Kraken_Abundance/{sample}_kreport.txt - Kraken reports
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps kraken_abundance \
  --kraken_db_path /path/to/kraken_db \
  --batch_size 10 \
  --cpus 16 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir ./logs

Option 2: Comprehensive Abundance Estimation (Recommended)

What it does:
  • Runs Kraken2 classification on all samples
  • Refines abundances using Bracken at multiple taxonomic levels
  • Combines results into abundance matrices
  • Generates comprehensive visualizations
  • Creates interactive HTML dashboard
  • Supports smart checkpointing (skips completed steps)
Key Features:
  • Smart Checkpointing: Automatically skips already completed steps
  • Multiple Taxonomic Levels: Analyzes at Species, Genus, etc.
  • Metadata Integration: Group comparisons if metadata provided
  • Automatic Visualizations: 8+ plot types generated
Key Outputs:
  • Kraken_Abundance/ - Raw Kraken2 outputs
  • Bracken_Abundance_{level}/ - Bracken estimates per taxonomic level
  • Merged_Bracken_Outputs/ - Combined abundance matrices
    • merged_bracken_S.txt - Species abundances
    • merged_bracken_G.txt - Genus abundances
  • Abundance_Plots/ - Visualizations
    • Top taxa barplots
    • Abundance heatmaps with clustering
    • PCA analysis plots
    • Alpha diversity metrics (Shannon, Simpson, Richness)
    • Stacked composition plots
    • Group comparison boxplots (if metadata provided)
    • Beta diversity analysis (if metadata provided)
  • abundance_summary.html - Interactive dashboard
Basic Usage (without metadata):
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps abundance_estimation \
  --kraken_db_path /path/to/kraken_db \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs
Advanced Usage (with metadata for group comparisons):
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps abundance_estimation \
  --kraken_db_path /path/to/kraken_db \
  --taxonomic_levels S G F \
  --read_length 150 \
  --threshold 10 \
  --metadata_file /path/to/metadata.csv \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir ./logs
Metadata File Format:
csv
Sample,Treatment
Sample_001,Control
Sample_002,Control
Sample_003,Treatment_A
Sample_004,Treatment_A
Sample_005,Treatment_B

âš ī¸ Important Notes

  • Kraken Database Path: Must point to the database built in Step 8.4
  • Bracken Requirements: Database must have kmer distribution files (database{read_length}mers.kmer_distrib)
  • Input Files: Uses host-removed reads from Host_Removal directory
  • Taxonomic Levels: S=Species, G=Genus, F=Family, O=Order, C=Class, P=Phylum
  • Force Rerun: Delete output directories to force re-execution if needed

Parameters Explained

Configurable Parameters:
  • --kraken_db_path (Required): Path to Kraken2 database
  • --taxonomic_levels: List of levels to analyze (default: S G)
  • --read_length: Read length for Bracken (default: 150)
  • --threshold: Minimum reads for Bracken classification (default: 10)
  • --metadata_file: CSV with Sample and Treatment columns (optional)
Resource Recommendations:
  • CPUs: 16-32 (Kraken2 is highly parallel)
  • Memory: 32-64GB (depends on database size)
  • Time: 4-6 hours for 50 samples

Visualization Examples

Generated Visualizations Include:
  1. Top Taxa Barplot: Shows most abundant species/genera
  2. Abundance Heatmap: Hierarchical clustering of samples and taxa
  3. PCA Analysis: Sample relationships based on composition
  4. Diversity Metrics: Shannon, Simpson, and Richness indices
  5. Stacked Composition: Relative abundances per sample
  6. Group Comparisons: Statistical comparisons between treatments (if metadata)
  7. Beta Diversity: Within vs between group distances (if metadata)

Step 11: Phylogenetic Tree Generation and Visualization

â„šī¸ Optional Advanced Analysis

Generate custom phylogenetic trees and create publication-quality visualizations with integrated taxonomic and functional annotations.

Step 11a: Generate Phylogenetic Tree

What it does:
  • Builds custom phylogenetic tree using GTDB-Tk de_novo_wf
  • Uses only user MAGs (excludes GTDB reference genomes)
  • Requires outgroup taxon for proper tree rooting
  • Creates standardized tree files for visualization
Requirements:
  • Must complete GTDB-Tk classification (Step 6) first
  • Requires dereplicated MAGs from dRep
  • Outgroup taxon must be specified
Key Outputs:
  • Phylogeny/gtdbtk/taxonomy2/gtdbtk.tree - Main tree file
  • Phylogeny/gtdbtk/taxonomy2/trees/ - All tree files
  • custom_taxonomy.txt - Taxonomy mapping file
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps mags_tree \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --outgroup_taxon "p__Firmicutes" \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-06:00:00" \
  --log_dir ./logs

âš ī¸ Outgroup Selection

Choose an appropriate outgroup taxon based on your data:

  • p__Firmicutes - Common for Bacteroidetes-rich samples
  • c__Bacilli - More specific class-level outgroup
  • p__Proteobacteria - For Firmicutes-dominated samples

Step 11b: Basic Tree Visualization

What it does:
  • Creates basic circular phylogenetic tree visualization
  • Adds taxonomic information as colored rings
  • Generates publication-ready PDF output
Key Outputs:
  • tree_visualization.pdf - Basic circular tree
bash
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

Step 11c: Enhanced Tree Visualization with Functional Annotations

What it does:
  • Creates multiple publication-quality tree visualizations
  • Integrates taxonomic, functional, and novel MAG information
  • Generates 4 different visualization types automatically
  • Adds functional annotation as pie charts or rings
Visualization Types Created:
  • Simplified Taxonomic (_simplified_taxonomic.pdf)
    • Circular tree with phylum and genus rings only
    • Novel MAG highlighting with red stars
  • Rectangular Tree (_rectangular_tree.pdf)
    • Rectangular layout with colored taxonomy strips
    • Enhanced fonts for publication quality
  • Multi-functional (_multi_functional.pdf)
    • Circular tree with CAZyme, COG, and KEGG pie charts
    • Multiple annotation rings
  • Enhanced Beautiful (_enhanced_beautiful.pdf)
    • Single functional annotation focus
    • CAZyme category pie charts
Key Outputs:
  • Phylogeny/gtdbtk/taxonomy2/enhanced_visualizations/ - All visualization files
  • enhanced_visualizations_finalrec/ - Rectangular and simplified trees
  • enhanced_visualizations_pie/ - Pie chart visualizations
bash
# Full enhanced visualization with all annotations
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --cog_annotations $WORK_DIR/Functional_Annotation/cog_summary.xlsx \
  --kegg_annotations $WORK_DIR/Functional_Annotation/kegg_summary.xlsx \
  --annotation_type auto \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

# Simplified version with only CAZyme annotations
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir ./logs

â„šī¸ Optional Parameters

  • --novel_mags_file: Text file with novel MAG IDs (one per line)
  • --functional_annotations: Path to annotation directory (auto-detected)
  • --annotation_type: auto, eggnog, kegg, or dbcan
  • Can provide any combination of cazyme, cog, and kegg annotations
🔧 R Environment Requirements

The visualization steps require R with specific packages:

  • Required: ape, RColorBrewer, jsonlite
  • Optional (enhanced): ggtree, ggplot2

Install R packages:

R
install.packages(c("ape", "RColorBrewer", "jsonlite"))
# For enhanced visualizations (optional):
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")

💡 Best Practices

  • Always run GTDB-Tk classification before tree generation
  • Use dereplicated MAGs for cleaner trees
  • Choose outgroup taxon from a distantly related phylum
  • Run functional annotation steps before enhanced visualization
  • Create novel_mags_list.txt from identify_novel_mags output
  • Use higher memory (32-64GB) for large trees (>100 MAGs)
bash
# Step 6.5: Phylogenetic Tree (Optional)
echo "Generating phylogenetic tree..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps mags_tree \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --outgroup_taxon "p__Firmicutes" \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

echo "Creating enhanced tree visualizations..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

Complete Pipeline Script

â„šī¸ Comprehensive Pipeline Script

This script includes all pipeline steps with proper parameters. Uncomment the sections you need and adjust paths/parameters for your project.

bash
#!/bin/bash
#SBATCH -p ghpc
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=200G
#SBATCH -t 15-00:00:00
#SBATCH --output=pipeline_complete_%j.out
#SBATCH --error=pipeline_complete_%j.err

# ============================================================================
# Complete MetaMAG Explorer Pipeline
# Version: 1.0.0
# ============================================================================

# CRITICAL: Set environment
CONDA_ENV="/your/path/miniconda3/bin/activate metamag"
METAMAG_DIR="/path/to/MetaMAG-1.0.0"
cd $METAMAG_DIR
source $CONDA_ENV

# === Core Variables (MODIFY THESE) ===
WORK_DIR="/path/to/project"
SAMPLES="/path/to/samples.txt"
PROJECT_CONFIG="/path/to/project_config.yaml"
REFERENCE="/path/to/host_genome.fa"
LOG_DIR="$WORK_DIR/logs"
mkdir -p $LOG_DIR

# === Database Paths ===
KRAKEN_DB="/path/to/kraken_db"
KEGG_DB="/path/to/ko.txt"
GTDBTK_DB="/path/to/gtdb/release"

# === Optional: Rumen-specific paths ===
RUMEN_REF_DIR="/path/to/rumen/reference_mags"
RUMEN_ADDED_DIR="/path/to/rumen/added_mags"
DATASET_DIR="/path/to/store/reference_mags"

echo "=========================================="
echo "Starting MetaMAG Explorer Pipeline"
echo "Project: $WORK_DIR"
echo "Date: $(date)"
echo "=========================================="

# ============================================================================
# PHASE 1: QUALITY CONTROL & PREPROCESSING
# ============================================================================

# Step 1.1: Quality Control
echo "[$(date)] Step 1.1: Running QC..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps qc \
  --batch_size 20 \
  --cpus 4 \
  --memory "8G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 1.2: MultiQC Report
echo "[$(date)] Step 1.2: Running MultiQC..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps multiqc \
  --input_dir $WORK_DIR/QC \
  --batch_size 1 \
  --cpus 2 \
  --memory "4G" \
  --time "0-00:30:00" \
  --log_dir $LOG_DIR

# Step 2.1: Trimming
echo "[$(date)] Step 2.1: Running trimming..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps trimming \
  --batch_size 10 \
  --cpus 8 \
  --memory "16G" \
  --time "0-04:00:00" \
  --log_dir $LOG_DIR

# Step 2.2: Host Removal
echo "[$(date)] Step 2.2: Running host removal..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps host_removal \
  --reference $REFERENCE \
  --batch_size 10 \
  --cpus 16 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 2: ASSEMBLY
# ============================================================================

# Step 3.1: Single Assembly (IDBA-UD)
echo "[$(date)] Step 3.1: Running single assembly..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_assembly \
  --batch_size 5 \
  --cpus 32 \
  --memory "100G" \
  --time "1-00:00:00" \
  --log_dir $LOG_DIR

# Step 3.2: Co-Assembly (MEGAHIT)
echo "[$(date)] Step 3.2: Running co-assembly..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps co_assembly \
  --batch_size 50 \
  --cpus 64 \
  --memory "200G" \
  --time "2-00:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 3: BINNING & REFINEMENT
# ============================================================================

# Step 4.1: Binning
echo "[$(date)] Step 4.1: Running binning..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_binning \
  --binning_methods metabat2 maxbin2 concoct \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir $LOG_DIR

# Step 4.2: Bin Refinement with DAS Tool
echo "[$(date)] Step 4.2: Running bin refinement..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps single_bin_refinement \
  --score_threshold 0.5 \
  --batch_size 10 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 4: QUALITY ASSESSMENT & DEREPLICATION
# ============================================================================

# Step 5.1: Quality Assessment with CheckM
echo "[$(date)] Step 5.1: Running CheckM evaluation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps evaluation \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# Step 5.2: Dereplication with dRep
echo "[$(date)] Step 5.2: Running dRep..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps dRep \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 5: TAXONOMIC CLASSIFICATION
# ============================================================================

# Step 6: GTDB-Tk Classification
echo "[$(date)] Step 6: Running GTDB-Tk..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps gtdbtk \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 6: RUMEN REFERENCE MAGs (OPTIONAL - for rumen microbiome projects)
# ============================================================================

# # Step 7.1: Download Rumen Reference MAGs
# echo "[$(date)] Step 7.1: Downloading rumen reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps rumen_refmags_download \
#   --dataset_dir $DATASET_DIR \
#   --batch_size 1 \
#   --cpus 4 \
#   --memory "16G" \
#   --time "0-12:00:00" \
#   --log_dir $LOG_DIR

# # Step 7.2: Dereplicate Reference MAGs
# echo "[$(date)] Step 7.2: Dereplicating reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps rumen_refmags_drep \
#   --dataset_dir $DATASET_DIR \
#   --drep_output_dir $RUMEN_REF_DIR/drep_output \
#   --batch_size 1 \
#   --cpus 32 \
#   --memory "100G" \
#   --time "1-00:00:00" \
#   --log_dir $LOG_DIR

# # Step 7.3: Classify Reference MAGs
# echo "[$(date)] Step 7.3: Classifying reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps rumen_refmags_gtdbtk \
#   --genome_dir $RUMEN_REF_DIR/drep_output/dereplicated_genomes \
#   --gtdbtk_out_dir $RUMEN_REF_DIR/gtdbtk_output \
#   --novel_output_dir $RUMEN_REF_DIR/novel_mags \
#   --genome_extension ".fa" \
#   --batch_size 1 \
#   --cpus 64 \
#   --memory "200G" \
#   --time "0-06:00:00" \
#   --log_dir $LOG_DIR

# # Step 7.4: Integrate with Project MAGs
# echo "[$(date)] Step 7.4: Integrating with reference MAGs..."
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --samples-file $SAMPLES \
#   --steps rumen_drep \
#   --ref_mags_dir $RUMEN_REF_DIR/drep_output/dereplicated_genomes \
#   --batch_size 1 \
#   --cpus 32 \
#   --memory "100G" \
#   --time "0-12:00:00" \
#   --log_dir $LOG_DIR

# ============================================================================
# PHASE 7: NOVEL MAG PROCESSING
# ============================================================================

# Step 8.1: Identify Novel MAGs
echo "[$(date)] Step 8.1: Identifying novel MAGs..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps identify_novel_mags \
  --custom_gtdbtk_dir $WORK_DIR/Novel_Mags/gtdbtk \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 8.2: Process Novel MAGs (General data)
echo "[$(date)] Step 8.2: Processing novel MAGs..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps process_novel_mags \
  --kraken_db_path $KRAKEN_DB \
  --merge_mode \
  --batch_size 1 \
  --cpus 64 \
  --memory "128G" \
  --time "0-12:00:00" \
  --log_dir $LOG_DIR

# # For rumen data, use this instead:
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --steps process_novel_mags \
#   --is_rumen_data \
#   --rumen_ref_mags_dir $RUMEN_REF_DIR \
#   --rumen_added_mags_dir $RUMEN_ADDED_DIR \
#   --kraken_db_path $KRAKEN_DB \
#   --merge_mode \
#   --batch_size 1 \
#   --cpus 64 \
#   --memory "128G" \
#   --time "0-12:00:00" \
#   --log_dir $LOG_DIR

# Step 8.3: Add MAGs to Repository
echo "[$(date)] Step 8.3: Adding MAGs to repository..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps add_mags_to_repo \
  --batch_size 1 \
  --cpus 8 \
  --memory "16G" \
  --time "0-01:00:00" \
  --log_dir $LOG_DIR

# Step 8.4: Build/Update Kraken Database
echo "[$(date)] Step 8.4: Building Kraken database..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps build_kraken_db \
  --kraken_db_path $KRAKEN_DB \
  --batch_size 1 \
  --cpus 64 \
  --memory "200G" \
  --time "1-00:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 8: FUNCTIONAL ANNOTATION
# ============================================================================

# Step 9.1: EggNOG Annotation
echo "[$(date)] Step 9.1: Running EggNOG annotation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps eggnog_annotation \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --batch_size 1 \
  --cpus 32 \
  --memory "64G" \
  --time "0-12:00:00" \
  --log_dir $LOG_DIR

# Step 9.2: dbCAN CAZyme Annotation
echo "[$(date)] Step 9.2: Running dbCAN annotation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps dbcan_annotation \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --batch_size 1 \
  --cpus 32 \
  --memory "32G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# Step 9.3: Integrated Functional Analysis
echo "[$(date)] Step 9.3: Running functional analysis..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps functional_analysis \
  --kegg_db_file $KEGG_DB \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 9.4: Advanced Visualizations
echo "[$(date)] Step 9.4: Creating advanced visualizations..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps advanced_visualizations \
  --viz_output_suffix "_final" \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-04:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PHASE 9: ABUNDANCE ESTIMATION
# ============================================================================

# Step 10: Comprehensive Abundance Estimation
echo "[$(date)] Step 10: Running abundance estimation..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --samples-file $SAMPLES \
  --steps abundance_estimation \
  --kraken_db_path $KRAKEN_DB \
  --taxonomic_levels S G F \
  --read_length 150 \
  --threshold 10 \
  --batch_size 50 \
  --cpus 32 \
  --memory "64G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# # With metadata for group comparisons:
# python3 -m MetaMAG.main \
#   --project_config $PROJECT_CONFIG \
#   --samples-file $SAMPLES \
#   --steps abundance_estimation \
#   --kraken_db_path $KRAKEN_DB \
#   --taxonomic_levels S G F \
#   --metadata_file $WORK_DIR/metadata.csv \
#   --batch_size 50 \
#   --cpus 32 \
#   --memory "64G" \
#   --time "0-06:00:00" \
#   --log_dir $LOG_DIR

# ============================================================================
# PHASE 10: PHYLOGENETIC ANALYSIS (OPTIONAL)
# ============================================================================

# Step 11.1: Generate Phylogenetic Tree
echo "[$(date)] Step 11.1: Generating phylogenetic tree..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps mags_tree \
  --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \
  --outgroup_taxon "p__Firmicutes" \
  --batch_size 1 \
  --cpus 32 \
  --memory "100G" \
  --time "0-06:00:00" \
  --log_dir $LOG_DIR

# Step 11.2: Basic Tree Visualization
echo "[$(date)] Step 11.2: Creating basic tree visualization..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# Step 11.3: Enhanced Tree Visualization with Annotations
echo "[$(date)] Step 11.3: Creating enhanced tree visualizations..."
python3 -m MetaMAG.main \
  --project_config $PROJECT_CONFIG \
  --steps enhanced_tree_visualization \
  --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \
  --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \
  --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \
  --cog_annotations $WORK_DIR/Functional_Annotation/cog_summary.xlsx \
  --kegg_annotations $WORK_DIR/Functional_Annotation/kegg_summary.xlsx \
  --annotation_type auto \
  --batch_size 1 \
  --cpus 8 \
  --memory "32G" \
  --time "0-02:00:00" \
  --log_dir $LOG_DIR

# ============================================================================
# PIPELINE COMPLETE
# ============================================================================

echo "=========================================="
echo "MetaMAG Explorer Pipeline Complete!"
echo "End time: $(date)"
echo "Output directory: $WORK_DIR"
echo ""
echo "Key outputs:"
echo "  - QC reports: $WORK_DIR/QC"
echo "  - Assemblies: $WORK_DIR/Assembly"
echo "  - MAGs: $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes"
echo "  - Taxonomy: $WORK_DIR/Novel_Mags/gtdbtk"
echo "  - Annotations: $WORK_DIR/Functional_Annotation"
echo "  - Abundances: $WORK_DIR/Kraken_Abundance"
echo "  - Visualizations: $WORK_DIR/Advanced_visualizations_final"
echo "=========================================="

✅ Script Usage Tips

  • Comment out steps you don't need (use # at the beginning of lines)
  • Uncomment optional sections like rumen processing if needed
  • Adjust SLURM parameters (#SBATCH lines) based on your cluster
  • Modify resource allocations (cpus, memory, time) per step as needed
  • Check that all database paths are correctly set before running
  • Consider running phases separately for better monitoring

💡 Recommended Workflow Execution

  1. Phase 1-2: QC, Preprocessing, Assembly (1-2 days)
  2. Phase 3-4: Binning, Quality Assessment (1 day)
  3. Phase 5-7: Taxonomy, Novel MAGs (1 day)
  4. Phase 8: Functional Annotation (1 day)
  5. Phase 9-10: Abundance, Phylogenetics (1 day)

Total estimated time: 5-7 days for a complete run with 50-100 samples

Troubleshooting

💡 Common Troubleshooting Steps

  • Check log files in $LOG_DIR
  • Verify input data quality
  • Ensure all dependencies are installed
  • Adjust computational resources as needed