đ Quick Start: Run the Pipeline
If you're looking for the fastest way to execute the pipeline:
Use the provided run.sh script.
It's preconfigured to handle most parameters, and you only need to update your file paths and step names.
What You Need To Do:
- Open run.sh in a text editor.
- Update these key variables:
- WORK_DIR -- your project directory
- SAMPLES -- path to your sample list file
- PROJECT_CONFIG -- your YAML config
- STEPS -- choose which step(s) to run
- CONDA_ENV -- path to your conda environment
- Other relevant paths (e.g., reference genome, Kraken DB)
- Submit the job to SLURM:
bash
sbatch run.sh
Note
Alternatively, if you'd like to run steps in a more customized or incremental way, refer to the detailed instructions below.
SLURM Setup Example
Here's a complete example of a SLURM submission script with all necessary components:
#!/bin/bash #SBATCH -p ghpc # Partition/queue name #SBATCH -N 1 # Number of nodes #SBATCH -n 1 # Number of tasks #SBATCH --mem=10G # Memory allocation #SBATCH -t 2:30:00 # Time limit (HH:MM:SS) #SBATCH --output=pipeline_meta_%j.out # Standard output log #SBATCH --error=pipeline_meta_%j.err # Standard error log # CRITICAL: Set your conda environment path (adjust to YOUR installation) CONDA_ENV="/path/conda/activate metamag" # Navigate to MetaMAG directory cd /path/MetaMAG-1.0.0 # Activate conda environment source $CONDA_ENV # Run the pipeline with specified step python3 -m MetaMAG.main \ --project_config plant_project_config.yaml \ --samples-file samples.txt \ --batch_size 16 \ --cpus 1 \ --memory 20G \ --time 15-12:00:00 \ --log_dir ./logs/plant/qc \ --steps qc
Adjust SLURM Parameters Based on Your Step
Different pipeline steps require different resources:
- QC: Low resources (4 CPUs, 8G memory, 2 hours)
- Trimming: Medium resources (8 CPUs, 16G memory, 4 hours)
- Assembly: High resources (32-64 CPUs, 100-200G memory, 1-2 days)
- Binning: High resources (32 CPUs, 64G memory, 12 hours)
- GTDB-Tk: Very high memory (64 CPUs, 200G memory, 6 hours)
Understanding Batch Size Parameter
Critical: How Batch Size Works
The --batch_size
parameter determines how many samples are processed together in a single job.
- If you have 20 samples and set
--batch_size 20
â 1 job processes all 20 samples - If you have 20 samples and set
--batch_size 10
â 2 parallel jobs, each processing 10 samples - If you have 20 samples and set
--batch_size 1
â 20 parallel jobs, each processing 1 sample
Steps That Must Run as Single Batch
The following steps process all data together and should always use a batch_size equal to your total sample count:
Single Batch Steps (set batch_size = total samples)
- multiqc - Combines all QC reports into one
- co_assembly - Combines all samples for assembly
- evaluation - Evaluates all MAGs together with CheckM
- dRep - Dereplicates all MAGs as a set
- gtdbtk - Classifies all MAGs in one run
- all Novel MAG steps - Process novel MAGs as a complete set
- all Functional Annotation steps - Annotate all MAGs together
- abundance_estimation - Processes all samples but in single workflow
- all Phylogenetic Tree steps - Build tree from all MAGs
- all Visualization steps - Generate plots from complete data
Steps That Can Be Parallelized
These steps process individual samples and benefit from smaller batch sizes for parallel execution:
Parallelizable Steps (adjust batch_size for parallelization)
- qc - Each sample's QC can run independently
- trimming - Each sample trimmed separately
- host_removal - Each sample processed independently
- single_assembly - Each sample assembled separately (use batch_size 1-2 due to high memory)
- single_binning - Each sample's assembly binned separately
- single_bin_refinement - Each sample's bins refined separately
- kraken_abundance - Each sample classified independently (but NOT abundance_estimation)
Recommended Batch Size Strategy
General Guidelines
For a project with N samples:
- Light steps (QC, trimming): batch_size = N/4 to N/2 (run 2-4 parallel jobs)
- Memory-intensive steps (assembly, binning): batch_size = 1 to 2 (maximize parallel jobs)
- Single-batch steps: batch_size = N or higher (run as one job)
- Cluster limits: Consider your cluster's max job submission limit
Memory Consideration
Remember: Lower batch_size means more parallel jobs but also more total memory usage across all jobs. For memory-intensive steps like assembly, use batch_size 1-2 even if it means many parallel jobs.
File Naming Requirements
đ Input File Naming Conventions
Files must follow these naming patterns:
{sample}_1.fastq.gz
and{sample}_2.fastq.gz
{sample}_R1.fastq.gz
and{sample}_R2.fastq.gz
{sample}_forward.fastq.gz
and{sample}_reverse.fastq.gz
Configuration Setup
Project Configuration File (ONLY 3 LINES NEEDED)
# project_config.yaml - ONLY THESE 3 LINES input_dir: "/path/to/raw/reads" output_dir: "/path/to/output" reference: "/path/to/host/genome.fa" # Optional for host removal
Sample List Format
# samples.txt Sample_001 Sample_002 Sample_003
Step 1: Quality Control
What it does:
- Runs initial quality assessment on raw sequencing reads
- Generates quality reports using FastQC
Key Outputs:
- Quality control reports
- Sequence quality metrics
Step 1a: Run FastQC
# CORRECT COMMAND FORMAT python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps qc \ --batch_size 20 \ --cpus 4 \ --memory "8G" \ --time "0-02:00:00" \ --log_dir ./logs
Step 1b: Generate MultiQC Report
# No samples file needed for MultiQC python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps multiqc \ --input_dir $WORK_DIR/QC \ --batch_size 1 \ --cpus 2 \ --memory "4G" \ --time "0-00:30:00" \ --log_dir ./logs
Step 2: Preprocessing
Step 2a: Trimming (Fastp)
What it does:
- Removes low-quality bases
- Trims adapter sequences
- Filters out poor-quality reads
Key Outputs:
- Cleaned read files
- Trimming statistics
Hardcoded parameters: Q20 quality, 30bp minimum length
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps trimming \ --batch_size 10 \ --cpus 8 \ --memory "16G" \ --time "0-04:00:00" \ --log_dir ./logs
Step 2b: Host Removal
What it does:
- Removes host-derived sequences
- Maps reads against reference genome
- Extracts non-host (microbial) reads
Key Outputs:
- Host-free metagenomic reads
- Mapping statistics
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps host_removal \ --reference $REFERENCE \ --batch_size 10 \ --cpus 16 \ --memory "32G" \ --time "0-06:00:00" \ --log_dir ./logs
Step 3: Assembly
Step 3a: Single Assembly (IDBA-UD)
What it does:
- Assembles reads for individual samples
- Generates contigs using IDBA-UD assembler
Key Outputs:
- Assembled contigs
- Assembly statistics
# ALWAYS uses IDBA-UD (no --assembler parameter) python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps single_assembly \ --batch_size 5 \ --cpus 32 \ --memory "100G" \ --time "1-00:00:00" \ --log_dir ./logs
Step 3b: Co-Assembly (MEGAHIT)
What it does:
- Combines reads from multiple samples
- Generates a unified assembly using MEGAHIT
Key Outputs:
- Co-assembled contigs
- Merged assembly statistics
Hardcoded: k-mer 21-141, step 12
# ALWAYS uses MEGAHIT with k-mer 21-141 python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps co_assembly \ --batch_size 50 \ --cpus 64 \ --memory "200G" \ --time "2-00:00:00" \ --log_dir ./logs
Step 4: Binning & Refinement
Step 4a: Binning
What it does:
- Clusters contigs into potential genome bins
- Uses multiple binning algorithms (MetaBAT2, MaxBin2, CONCOCT)
Key Outputs:
- Initial genome bins
- Binning statistics
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps single_binning \ --binning_methods metabat2 maxbin2 concoct \ --batch_size 10 \ --cpus 32 \ --memory "64G" \ --time "0-12:00:00" \ --log_dir ./logs
Step 4b: Refinement with DAS Tool
What it does:
- Refines and merges bins from multiple methods
- Improves bin quality using DAS Tool
- Filters bins to process only valid naming patterns
Key Outputs:
- Refined, high-quality genome bins
- DAS Tool quality scores
- Filtered bin directories
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps single_bin_refinement \ --score_threshold 0.5 \ --batch_size 10 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir ./logs
DAS Tool Processing Notes
- Score threshold: 0.5 (can be adjusted with --score_threshold)
- Automatically handles missing or empty bin directories
- Creates TSV mapping files for each binning method
- Continues pipeline even if DAS Tool fails due to low-quality bins
Step 5: Quality Assessment & Dereplication
Step 5a: Quality Assessment with CheckM
What it does:
- Runs CheckM lineage workflow to assess genome quality
- Estimates completeness using lineage-specific marker genes
- Calculates contamination based on duplicated markers
- Generates detailed quality metrics and visualization plots
CheckM Workflow Details:
- Method: lineage_wf (lineage-specific workflow)
- Marker set: Automatically selected based on taxonomic placement
- Extension: .fa files
Key Outputs:
- CheckM Raw Outputs:
- lineage.ms - Marker lineage information
- storage/bin_stats_ext.tsv - Raw CheckM statistics
- checkm_output.log - CheckM execution log
- Processed Outputs:
- cleaned_checkm_output.csv - Formatted quality metrics table
- plots/completeness_contamination_plot.pdf - Bubble plot (size = genome size)
- plots/genome_quality_plot.pdf - Quality categories with histograms
- plots/scaffolds_histogram.pdf - Scaffold count distribution
CheckM Processing Notes
- CheckM lineage_wf is used (not qa or taxonomy_wf)
- Automatically processes bin_stats_ext.tsv to create cleaned CSV
- Skips execution if all outputs exist (delete output dir to re-run)
- Quality categories:
- Near Complete: âĨ90% complete, â¤5% contamination
- Medium Quality: âĨ70% complete, â¤10% contamination
- Partial: âĨ50% complete, â¤4% contamination
Run CheckM on dRep output (default):
# Runs CheckM on dereplicated genomes from dRep python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps evaluation \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir ./logs
Run CheckM on custom genome directory:
# Evaluate genomes from a custom directory python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps evaluation \ --evaluation_input_dir /path/to/custom/genomes \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir ./logs
Step 5b: Dereplication with dRep
What it does:
- Removes redundant MAGs across samples
- Clusters genomes based on ANI similarity
- Selects representative genome from each cluster
Key Outputs:
- dereplicated_genomes/ - Non-redundant MAG set
- data_tables/genomeInfo.csv - Quality and clustering information
- passing_genomes.csv - List of genomes passing QC (if recovery mode activated)
dRep Configuration (Hardcoded Parameters)
- Minimum completeness: 80%
- Maximum contamination: 10%
- Strain heterogeneity: 100
- ANI method: fastANI
- Secondary clustering ANI: 95%
Automatic Recovery: If dRep fails due to insufficient genomes passing quality thresholds, the pipeline automatically identifies and copies all genomes meeting the thresholds to a 'dereplicated_genomes' directory. A 'passing_genomes.csv' file is generated listing all genomes that passed QC.
Bin Renaming: All bins are automatically renamed with their source prefix (e.g., Sample_001_bin.1.fa for single samples, coassembly_bin.1.fa for co-assembly).
Run CheckM + dRep together:
# Run both CheckM evaluation and dRep dereplication python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps evaluation dRep \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir ./logs
Step 6: Taxonomic Classification (GTDB-Tk)
What it does:
- Assigns taxonomic classifications to MAGs
- Uses Genome Taxonomy Database (GTDB) for classification
- Runs classify_wf workflow only (not de novo workflow)
Key Outputs:
- classify/gtdbtk.bac120.summary.tsv - Bacterial classifications
- classify/gtdbtk.ar53.summary.tsv - Archaeal classifications
- Phylogenetic placement information
GTDB-Tk Path Configuration
Input Path (Fixed):
{project_output}/Bin_Refinement/drep/dRep_output/dereplicated_genomes
Output Path (Fixed):
{project_output}/Novel_Mags/gtdbtk
Processing Notes:
- ANI screening is disabled (--skip_ani_screen) for faster processing
- Only runs taxonomic classification, not phylogenetic tree inference
- File extension expected: .fa
- Automatically creates output directory structure
# NO samples file needed # Paths are automatically determined from project config python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps gtdbtk \ --batch_size 1 \ --cpus 64 \ --memory "200G" \ --time "0-06:00:00" \ --log_dir ./logs
Memory Requirements
GTDB-Tk requires substantial memory (~200GB) due to the reference database size. Ensure adequate resources are allocated. The tool may fail with insufficient memory without clear error messages.
Step 7: Rumen Reference MAGs Processing (Optional)
For Rumen Microbiome Projects
These steps are specifically designed for rumen microbiome studies. They integrate high-quality reference genomes from major rumen microbiome databases (RGMGC, MGnify, RUG).
Step 7.1: Download Reference MAGs
What it does:
- Downloads reference MAGs from three major sources:
- RGMGC: 10,373 genomes from ENA (European Nucleotide Archive)
- MGnify: 2,729 genomes from cow rumen species catalogue
- RUG: 4,941 genomes from Rumen Uncultured Genomes dataset
- Automatically handles download retries and error recovery
- Processes downloaded files (extracts .gz, standardizes to .fa extension)
- Total: ~18,043 reference genomes if all downloaded successfully
Requirements:
- Internet connection for downloading
- ~50-100GB disk space
- dataset_file_list.txt and ena_mags_list.tsv files in script directory
Key Outputs:
- All files standardized with .fa extension in single directory
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps rumen_refmags_download \ --dataset_dir /path/to/store/reference_mags \ --batch_size 1 \ --cpus 4 \ --memory "16G" \ --time "0-12:00:00" \ --log_dir ./logs
The download process checks for existing files and skips already downloaded genomes, making it resumable if interrupted.
Step 7.2: Dereplicate Reference MAGs
What it does:
- Uses dRep to dereplicate the downloaded reference MAG collection
- Removes redundant genomes based on ANI thresholds
- Selects highest quality representative from each cluster
dRep Parameters (Hardcoded):
- Completeness threshold: 80%
- Contamination threshold: 10%
- ANI method: fastANI
- Secondary clustering: 95% ANI
Key Outputs:
- dereplicated_genomes/ - Representative genomes
- data_tables/genomeInfo.csv - Quality and clustering information
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps rumen_refmags_drep \ --dataset_dir /path/to/downloaded/reference_mags \ --drep_output_dir /path/to/dereplicated_output \ --batch_size 1 \ --cpus 32 \ --memory "100G" \ --time "1-00:00:00" \ --log_dir ./logs
Step 7.3: Classify Reference MAGs with GTDB-Tk
What it does:
- Runs GTDB-Tk classify_wf on reference MAGs
- Identifies novel MAGs (without species-level classification)
- Processes bacterial and archaeal domains separately
- Can skip GTDB-Tk if already run (--only_novel flag)
Key Outputs:
- classify/gtdbtk.bac120.summary.tsv - Bacterial classifications
- classify/gtdbtk.ar53.summary.tsv - Archaeal classifications
- novel_rumenref_mags/mags/ - Novel MAG files
- novel_rumenref_mags/taxonomy/ - Novel MAG taxonomy files
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps rumen_refmags_gtdbtk \ --genome_dir /path/to/dereplicated_genomes \ --gtdbtk_out_dir /path/to/gtdbtk_output \ --novel_output_dir /path/to/novel_mags_output \ --genome_extension ".fa" \ --batch_size 1 \ --cpus 64 \ --memory "200G" \ --time "0-06:00:00" \ --log_dir ./logs # To skip GTDB-Tk if already run (only identify novel MAGs): # Add --only_novel flag
Step 7.4: Integrate Project MAGs with Reference MAGs
What it does:
- Combines your project MAGs with reference MAGs
- Performs co-dereplication of combined set
- Identifies which project MAGs are truly novel
- Creates unified non-redundant collection
Key Outputs:
- Novel_Mags/UniqueProjMags/ - Project-specific unique MAGs
- Novel_Mags/UniqueRefMags/ - Reference-specific unique MAGs
- uniqueness_summary.txt - Dereplication statistics
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps rumen_drep \ --ref_mags_dir /path/to/reference/dereplicated_genomes \ --batch_size 1 \ --cpus 32 \ --memory "100G" \ --time "0-12:00:00" \ --log_dir ./logs
Step 8: Novel MAG Processing
Step 8.1: Identify Novel MAGs
What it does:
- Analyzes GTDB-Tk results to find MAGs without species designation
- Extracts novel genome candidates from classification
- Copies novel MAGs to dedicated directory
Key Outputs:
- Novel_Mags/UniqueMags/ - Candidate novel MAGs
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps identify_novel_mags \ --custom_gtdbtk_dir /path/to/gtdbtk/results \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir ./logs
Step 8.2: Process Novel MAGs (Complete Pipeline)
What it does:
- Comprehensive pipeline for novel MAG processing
- Dereplicates against existing repository
- For rumen data: dereplicates against reference MAGs
- Adds to MAGs repository
- Builds/updates Kraken database
Required for Rumen Data:
- --is_rumen_data flag
- --rumen_ref_mags_dir (reference MAGs)
- --rumen_added_mags_dir (previously added MAGs)
Key Outputs:
- Novel_Mags/filtered_NMAGs/ - Post-repository dereplication
- Novel_Mags/true_novel_MAGs/ - Final novel MAGs
- MAGs_Repository/ - Updated repository
- Kraken_Database/ - Updated Kraken2 database
# For general data: python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps process_novel_mags \ --kraken_db_path /path/to/kraken_db \ --merge_mode \ --batch_size 1 \ --cpus 64 \ --memory "128G" \ --time "0-12:00:00" \ --log_dir ./logs # For rumen data: python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps process_novel_mags \ --is_rumen_data \ --rumen_ref_mags_dir /path/to/rumen/reference/mags \ --rumen_added_mags_dir /path/to/rumen/added/mags \ --kraken_db_path /path/to/kraken_db \ --merge_mode \ --batch_size 1 \ --cpus 64 \ --memory "128G" \ --time "0-12:00:00" \ --log_dir ./logs
Merge Mode
--merge_mode: Updates existing Kraken database (recommended)
--no_merge: Creates fresh database (use carefully)
Step 8.3: Add MAGs to Repository
What it does:
- Adds validated novel MAGs to central repository
- Maintains collection for future reuse
Key Outputs:
- MAGs_Repository/ - Updated with new MAGs
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps add_mags_to_repo \ --batch_size 1 \ --cpus 8 \ --memory "16G" \ --time "0-01:00:00" \ --log_dir ./logs
Step 8.4: Build Kraken Database
What it does:
- Creates/updates Kraken2 database with novel MAGs
- Uses advanced taxonomy resolver for conflicts
- Builds Bracken database for abundance estimation
Processing Steps:
- Placeholder taxa assignment
- GTDB to Taxdump conversion
- Header renaming for Kraken
- Add to library
- Build Kraken database
- Build Bracken database
- Build distribution
Key Outputs:
- Kraken_Database/taxonomy/ - nodes.dmp, names.dmp, taxid.map
- Kraken_Database/library/ - MAG sequences
- Kraken_Database/*.k2d - Database files
- Bracken database files for abundance estimation
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps build_kraken_db \ --kraken_db_path /path/to/kraken_db \ --batch_size 1 \ --cpus 64 \ --memory "200G" \ --time "1-00:00:00" \ --log_dir ./logs
Required Scripts
The pipeline requires these scripts in the main directory:
- gtdb_to_taxdump_latest_resolve.py - Advanced taxonomy resolver
- header_kraken.py - Header renaming script
Recommended Workflow for Rumen Microbiome Data
Download References (7.1) â Dereplicate References (7.2) â Classify References (7.3) â Process Your Samples (Steps 1-6) â Integrate with References (7.4) â Identify Novel MAGs (8.1) â Process Novel MAGs (8.2) â Build Database (8.4) â Continue Pipeline (Steps 9+)
Step 9: Functional Annotation
Comprehensive Functional Analysis Pipeline
The functional annotation pipeline includes three main components: EggNOG annotation for orthology groups, dbCAN for CAZyme identification, and integrated functional analysis with visualization.
Step 9.1: EggNOG Annotation
What it does:
- Predicts proteins from MAGs using Prodigal (if not already done)
- Annotates genes with functional information using EggNOG-mapper
- Uses Diamond for fast alignment (falls back to HMMER if needed)
- Extracts KO terms, COG categories, and CAZy annotations
- Creates temporary database in /dev/shm for performance
Key Outputs:
- eggnog_output/*.emapper.annotations - Raw EggNOG annotations
- functional_annotations/KOs/ - KEGG Orthology terms
- functional_annotations/COGs/ - COG categories
- functional_annotations/CAZy/ - CAZyme annotations
Configuration Requirements:
- eggnog_mapper path in config
- eggnog_db_dir path in config
- prodigal path in config
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps eggnog_annotation \ --mags_dir /path/to/final_mags \ --batch_size 1 \ --cpus 32 \ --memory "64G" \ --time "0-12:00:00" \ --log_dir ./logs
Step 9.2: dbCAN CAZyme Annotation
What it does:
- Identifies Carbohydrate-Active Enzymes (CAZymes)
- Predicts proteins if needed (shares with EggNOG)
- Runs dbCAN2 with HMMER for CAZyme detection
- Summarizes CAZyme families and categories
- Creates visualization plots automatically
- Groups CAZymes by substrate specificity
Key Outputs:
- dbcan_output/*_dbcan_results/ - dbCAN results per MAG
- cazyme_annotations/cazyme_counts/ - Count matrices
- cazyme_annotations/*.xlsx - Excel summaries
- cazyme_annotations/plots/ - Visualization plots:
- CAZyme heatmaps
- Category distributions
- Substrate-specific groupings
- Dendrogram clustering
Configuration Requirements:
- dbcan path in config
- dbcan_db_dir path in config
- dbcan_activate conda environment path (optional)
- prodigal path in config
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps dbcan_annotation \ --mags_dir /path/to/final_mags \ --batch_size 1 \ --cpus 32 \ --memory "32G" \ --time "0-06:00:00" \ --log_dir ./logs
Automatic Visualizations
dbCAN automatically generates multiple plots including heatmaps, category distributions, substrate groupings, and clustering dendrograms. These require matplotlib, seaborn, and pandas to be installed.
Step 9.3: Integrated Functional Analysis
What it does:
- Integrates results from EggNOG and dbCAN
- Maps KO terms to KEGG pathway hierarchy
- Creates comprehensive functional profiles
- Generates advanced visualizations
- Produces HTML summary report
- Optional species-based analysis
Key Outputs:
- functional_analysis/ - Main analysis directory with:
- files/ko/ - KO term matrices
- files/cog/ - COG category matrices
- files/kegg/ - KEGG pathway mappings
- plots/ - Various visualization types:
- COG main category distributions
- KEGG Sankey diagrams
- Network visualizations
- Hierarchical clustered heatmaps
- Circular plots
- Treemaps and bubble plots
- reports/functional_analysis_report.html - Interactive HTML report
Required Files:
- ko.txt - KEGG database file (optional but recommended)
- EggNOG annotation results from Step 9.1
# Basic functional analysis python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps functional_analysis \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir ./logs # With KEGG database file for pathway mapping python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps functional_analysis \ --kegg_db_file /path/to/ko.txt \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir ./logs
KEGG Database File
The ko.txt file enables KEGG pathway hierarchy mapping. Without it, the analysis will skip pathway-level visualizations but still process COG and KO annotations.
Step 9.4: Advanced Visualizations
Prerequisites:
- â Must complete Step 9.1 (EggNOG annotation) first
- â Must complete Step 9.3 (Functional Analysis) first
- Optional: Step 9.2 (dbCAN) for CAZyme visualizations
What it does:
- Creates comprehensive publication-quality visualizations
- Generates two complete sets: "All MAGs" and "Novel MAGs Only"
- Produces 8+ visualization types automatically:
- CAZyme heatmaps (clustered and phylum-grouped)
- COG functional Sankey diagrams
- KEGG pathway Sankey diagrams
- KEGG metabolism detailed flows
- Functional PCA analyses
- Network visualizations with pie charts
- Integrated multi-omics views
- Automatically handles MAG naming convention differences
- Limits some visualizations to 30-40 MAGs for clarity
Key Outputs:
- Advanced_visualizations{suffix}/ - Main directory with all MAGs
- Advanced_visualizations{suffix}/Novel_MAGs_Only/ - Novel MAGs subset
- Output formats: PDF, PNG, SVG, and interactive HTML
- Word-optimized versions with larger fonts (*_WORD_COMPACT.png)
Required Python Libraries:
- matplotlib, seaborn, pandas, numpy
- scipy, scikit-learn
- plotly (optional - for interactive plots)
- networkx (optional - for network analysis)
# Run advanced visualizations python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps advanced_visualizations \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-04:00:00" \ --log_dir ./logs # With output suffix for organization python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps advanced_visualizations \ --viz_output_suffix "_final" \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-04:00:00" \ --log_dir ./logs
Processing Notes
- Runs all visualizations automatically in two phases
- Phase 1: Analyzes all MAGs in the project
- Phase 2: Creates separate visualizations for novel MAGs only
- Use
--viz_output_suffix
to organize multiple runs - Large projects may take 2-4 hours to complete
Running Complete Functional Annotation Pipeline
# Run all functional annotation steps sequentially python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps eggnog_annotation dbcan_annotation functional_analysis \ --mags_dir /path/to/final_mags \ --kegg_db_file /path/to/ko.txt \ --batch_size 1 \ --cpus 32 \ --memory "64G" \ --time "0-18:00:00" \ --log_dir ./logs
Step 10: Abundance Estimation
Comprehensive Abundance Analysis Pipeline
This step performs taxonomic classification and abundance estimation using your custom Kraken2 database (with novel MAGs) and Bracken for accurate abundance quantification.
Prerequisites
Required Before Running:
- â Host removal completed (Step 2b)
- â Kraken2 database built (Step 8.4)
- â Bracken database files exist (built with Step 8.4)
Option 1: Single Sample Kraken Classification
What it does:
- Runs Kraken2 classification on individual samples
- Generates per-sample classification reports
- Simple classification without abundance refinement
Key Outputs:
- Kraken_Abundance/{sample}_kraken.txt - Raw classifications
- Kraken_Abundance/{sample}_kreport.txt - Kraken reports
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps kraken_abundance \ --kraken_db_path /path/to/kraken_db \ --batch_size 10 \ --cpus 16 \ --memory "32G" \ --time "0-04:00:00" \ --log_dir ./logs
Option 2: Comprehensive Abundance Estimation (Recommended)
What it does:
- Runs Kraken2 classification on all samples
- Refines abundances using Bracken at multiple taxonomic levels
- Combines results into abundance matrices
- Generates comprehensive visualizations
- Creates interactive HTML dashboard
- Supports smart checkpointing (skips completed steps)
Key Features:
- Smart Checkpointing: Automatically skips already completed steps
- Multiple Taxonomic Levels: Analyzes at Species, Genus, etc.
- Metadata Integration: Group comparisons if metadata provided
- Automatic Visualizations: 8+ plot types generated
Key Outputs:
- Kraken_Abundance/ - Raw Kraken2 outputs
- Bracken_Abundance_{level}/ - Bracken estimates per taxonomic level
- Merged_Bracken_Outputs/ - Combined abundance matrices
- merged_bracken_S.txt - Species abundances
- merged_bracken_G.txt - Genus abundances
- Abundance_Plots/ - Visualizations
- Top taxa barplots
- Abundance heatmaps with clustering
- PCA analysis plots
- Alpha diversity metrics (Shannon, Simpson, Richness)
- Stacked composition plots
- Group comparison boxplots (if metadata provided)
- Beta diversity analysis (if metadata provided)
- abundance_summary.html - Interactive dashboard
Basic Usage (without metadata):
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps abundance_estimation \ --kraken_db_path /path/to/kraken_db \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir ./logs
Advanced Usage (with metadata for group comparisons):
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps abundance_estimation \ --kraken_db_path /path/to/kraken_db \ --taxonomic_levels S G F \ --read_length 150 \ --threshold 10 \ --metadata_file /path/to/metadata.csv \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir ./logs
Metadata File Format:
Sample,Treatment Sample_001,Control Sample_002,Control Sample_003,Treatment_A Sample_004,Treatment_A Sample_005,Treatment_B
Important Notes
- Kraken Database Path: Must point to the database built in Step 8.4
- Bracken Requirements: Database must have kmer distribution files (database{read_length}mers.kmer_distrib)
- Input Files: Uses host-removed reads from Host_Removal directory
- Taxonomic Levels: S=Species, G=Genus, F=Family, O=Order, C=Class, P=Phylum
- Force Rerun: Delete output directories to force re-execution if needed
Parameters Explained
Configurable Parameters:
--kraken_db_path
(Required): Path to Kraken2 database--taxonomic_levels
: List of levels to analyze (default: S G)--read_length
: Read length for Bracken (default: 150)--threshold
: Minimum reads for Bracken classification (default: 10)--metadata_file
: CSV with Sample and Treatment columns (optional)
Resource Recommendations:
- CPUs: 16-32 (Kraken2 is highly parallel)
- Memory: 32-64GB (depends on database size)
- Time: 4-6 hours for 50 samples
Visualization Examples
Generated Visualizations Include:
- Top Taxa Barplot: Shows most abundant species/genera
- Abundance Heatmap: Hierarchical clustering of samples and taxa
- PCA Analysis: Sample relationships based on composition
- Diversity Metrics: Shannon, Simpson, and Richness indices
- Stacked Composition: Relative abundances per sample
- Group Comparisons: Statistical comparisons between treatments (if metadata)
- Beta Diversity: Within vs between group distances (if metadata)
Step 11: Phylogenetic Tree Generation and Visualization
Optional Advanced Analysis
Generate custom phylogenetic trees and create publication-quality visualizations with integrated taxonomic and functional annotations.
Step 11a: Generate Phylogenetic Tree
What it does:
- Builds custom phylogenetic tree using GTDB-Tk de_novo_wf
- Uses only user MAGs (excludes GTDB reference genomes)
- Requires outgroup taxon for proper tree rooting
- Creates standardized tree files for visualization
Requirements:
- Must complete GTDB-Tk classification (Step 6) first
- Requires dereplicated MAGs from dRep
- Outgroup taxon must be specified
Key Outputs:
- Phylogeny/gtdbtk/taxonomy2/gtdbtk.tree - Main tree file
- Phylogeny/gtdbtk/taxonomy2/trees/ - All tree files
- custom_taxonomy.txt - Taxonomy mapping file
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps mags_tree \ --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \ --outgroup_taxon "p__Firmicutes" \ --batch_size 1 \ --cpus 32 \ --memory "100G" \ --time "0-06:00:00" \ --log_dir ./logs
Outgroup Selection
Choose an appropriate outgroup taxon based on your data:
p__Firmicutes
- Common for Bacteroidetes-rich samplesc__Bacilli
- More specific class-level outgroupp__Proteobacteria
- For Firmicutes-dominated samples
Step 11b: Basic Tree Visualization
What it does:
- Creates basic circular phylogenetic tree visualization
- Adds taxonomic information as colored rings
- Generates publication-ready PDF output
Key Outputs:
- tree_visualization.pdf - Basic circular tree
python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps tree_visualization \ --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir ./logs
Step 11c: Enhanced Tree Visualization with Functional Annotations
What it does:
- Creates multiple publication-quality tree visualizations
- Integrates taxonomic, functional, and novel MAG information
- Generates 4 different visualization types automatically
- Adds functional annotation as pie charts or rings
Visualization Types Created:
- Simplified Taxonomic (_simplified_taxonomic.pdf)
- Circular tree with phylum and genus rings only
- Novel MAG highlighting with red stars
- Rectangular Tree (_rectangular_tree.pdf)
- Rectangular layout with colored taxonomy strips
- Enhanced fonts for publication quality
- Multi-functional (_multi_functional.pdf)
- Circular tree with CAZyme, COG, and KEGG pie charts
- Multiple annotation rings
- Enhanced Beautiful (_enhanced_beautiful.pdf)
- Single functional annotation focus
- CAZyme category pie charts
Key Outputs:
- Phylogeny/gtdbtk/taxonomy2/enhanced_visualizations/ - All visualization files
- enhanced_visualizations_finalrec/ - Rectangular and simplified trees
- enhanced_visualizations_pie/ - Pie chart visualizations
# Full enhanced visualization with all annotations python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps enhanced_tree_visualization \ --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \ --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \ --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \ --cog_annotations $WORK_DIR/Functional_Annotation/cog_summary.xlsx \ --kegg_annotations $WORK_DIR/Functional_Annotation/kegg_summary.xlsx \ --annotation_type auto \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir ./logs # Simplified version with only CAZyme annotations python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps enhanced_tree_visualization \ --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \ --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir ./logs
Optional Parameters
--novel_mags_file
: Text file with novel MAG IDs (one per line)--functional_annotations
: Path to annotation directory (auto-detected)--annotation_type
: auto, eggnog, kegg, or dbcan- Can provide any combination of cazyme, cog, and kegg annotations
đ§ R Environment Requirements
The visualization steps require R with specific packages:
- Required: ape, RColorBrewer, jsonlite
- Optional (enhanced): ggtree, ggplot2
Install R packages:
install.packages(c("ape", "RColorBrewer", "jsonlite")) # For enhanced visualizations (optional): if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ggtree")
đĄ Best Practices
- Always run GTDB-Tk classification before tree generation
- Use dereplicated MAGs for cleaner trees
- Choose outgroup taxon from a distantly related phylum
- Run functional annotation steps before enhanced visualization
- Create novel_mags_list.txt from identify_novel_mags output
- Use higher memory (32-64GB) for large trees (>100 MAGs)
# Step 6.5: Phylogenetic Tree (Optional) echo "Generating phylogenetic tree..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps mags_tree \ --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \ --outgroup_taxon "p__Firmicutes" \ --batch_size 1 \ --cpus 32 \ --memory "100G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR echo "Creating enhanced tree visualizations..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps enhanced_tree_visualization \ --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \ --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \ --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir $LOG_DIR
Complete Pipeline Script
Comprehensive Pipeline Script
This script includes all pipeline steps with proper parameters. Uncomment the sections you need and adjust paths/parameters for your project.
#!/bin/bash #SBATCH -p ghpc #SBATCH -N 1 #SBATCH -n 1 #SBATCH --mem=200G #SBATCH -t 15-00:00:00 #SBATCH --output=pipeline_complete_%j.out #SBATCH --error=pipeline_complete_%j.err # ============================================================================ # Complete MetaMAG Explorer Pipeline # Version: 1.0.0 # ============================================================================ # CRITICAL: Set environment CONDA_ENV="/your/path/miniconda3/bin/activate metamag" METAMAG_DIR="/path/to/MetaMAG-1.0.0" cd $METAMAG_DIR source $CONDA_ENV # === Core Variables (MODIFY THESE) === WORK_DIR="/path/to/project" SAMPLES="/path/to/samples.txt" PROJECT_CONFIG="/path/to/project_config.yaml" REFERENCE="/path/to/host_genome.fa" LOG_DIR="$WORK_DIR/logs" mkdir -p $LOG_DIR # === Database Paths === KRAKEN_DB="/path/to/kraken_db" KEGG_DB="/path/to/ko.txt" GTDBTK_DB="/path/to/gtdb/release" # === Optional: Rumen-specific paths === RUMEN_REF_DIR="/path/to/rumen/reference_mags" RUMEN_ADDED_DIR="/path/to/rumen/added_mags" DATASET_DIR="/path/to/store/reference_mags" echo "==========================================" echo "Starting MetaMAG Explorer Pipeline" echo "Project: $WORK_DIR" echo "Date: $(date)" echo "==========================================" # ============================================================================ # PHASE 1: QUALITY CONTROL & PREPROCESSING # ============================================================================ # Step 1.1: Quality Control echo "[$(date)] Step 1.1: Running QC..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps qc \ --batch_size 20 \ --cpus 4 \ --memory "8G" \ --time "0-02:00:00" \ --log_dir $LOG_DIR # Step 1.2: MultiQC Report echo "[$(date)] Step 1.2: Running MultiQC..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps multiqc \ --input_dir $WORK_DIR/QC \ --batch_size 1 \ --cpus 2 \ --memory "4G" \ --time "0-00:30:00" \ --log_dir $LOG_DIR # Step 2.1: Trimming echo "[$(date)] Step 2.1: Running trimming..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps trimming \ --batch_size 10 \ --cpus 8 \ --memory "16G" \ --time "0-04:00:00" \ --log_dir $LOG_DIR # Step 2.2: Host Removal echo "[$(date)] Step 2.2: Running host removal..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps host_removal \ --reference $REFERENCE \ --batch_size 10 \ --cpus 16 \ --memory "32G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 2: ASSEMBLY # ============================================================================ # Step 3.1: Single Assembly (IDBA-UD) echo "[$(date)] Step 3.1: Running single assembly..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps single_assembly \ --batch_size 5 \ --cpus 32 \ --memory "100G" \ --time "1-00:00:00" \ --log_dir $LOG_DIR # Step 3.2: Co-Assembly (MEGAHIT) echo "[$(date)] Step 3.2: Running co-assembly..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps co_assembly \ --batch_size 50 \ --cpus 64 \ --memory "200G" \ --time "2-00:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 3: BINNING & REFINEMENT # ============================================================================ # Step 4.1: Binning echo "[$(date)] Step 4.1: Running binning..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps single_binning \ --binning_methods metabat2 maxbin2 concoct \ --batch_size 10 \ --cpus 32 \ --memory "64G" \ --time "0-12:00:00" \ --log_dir $LOG_DIR # Step 4.2: Bin Refinement with DAS Tool echo "[$(date)] Step 4.2: Running bin refinement..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps single_bin_refinement \ --score_threshold 0.5 \ --batch_size 10 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 4: QUALITY ASSESSMENT & DEREPLICATION # ============================================================================ # Step 5.1: Quality Assessment with CheckM echo "[$(date)] Step 5.1: Running CheckM evaluation..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps evaluation \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # Step 5.2: Dereplication with dRep echo "[$(date)] Step 5.2: Running dRep..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps dRep \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 5: TAXONOMIC CLASSIFICATION # ============================================================================ # Step 6: GTDB-Tk Classification echo "[$(date)] Step 6: Running GTDB-Tk..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps gtdbtk \ --batch_size 1 \ --cpus 64 \ --memory "200G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 6: RUMEN REFERENCE MAGs (OPTIONAL - for rumen microbiome projects) # ============================================================================ # # Step 7.1: Download Rumen Reference MAGs # echo "[$(date)] Step 7.1: Downloading rumen reference MAGs..." # python3 -m MetaMAG.main \ # --project_config $PROJECT_CONFIG \ # --steps rumen_refmags_download \ # --dataset_dir $DATASET_DIR \ # --batch_size 1 \ # --cpus 4 \ # --memory "16G" \ # --time "0-12:00:00" \ # --log_dir $LOG_DIR # # Step 7.2: Dereplicate Reference MAGs # echo "[$(date)] Step 7.2: Dereplicating reference MAGs..." # python3 -m MetaMAG.main \ # --project_config $PROJECT_CONFIG \ # --steps rumen_refmags_drep \ # --dataset_dir $DATASET_DIR \ # --drep_output_dir $RUMEN_REF_DIR/drep_output \ # --batch_size 1 \ # --cpus 32 \ # --memory "100G" \ # --time "1-00:00:00" \ # --log_dir $LOG_DIR # # Step 7.3: Classify Reference MAGs # echo "[$(date)] Step 7.3: Classifying reference MAGs..." # python3 -m MetaMAG.main \ # --project_config $PROJECT_CONFIG \ # --steps rumen_refmags_gtdbtk \ # --genome_dir $RUMEN_REF_DIR/drep_output/dereplicated_genomes \ # --gtdbtk_out_dir $RUMEN_REF_DIR/gtdbtk_output \ # --novel_output_dir $RUMEN_REF_DIR/novel_mags \ # --genome_extension ".fa" \ # --batch_size 1 \ # --cpus 64 \ # --memory "200G" \ # --time "0-06:00:00" \ # --log_dir $LOG_DIR # # Step 7.4: Integrate with Project MAGs # echo "[$(date)] Step 7.4: Integrating with reference MAGs..." # python3 -m MetaMAG.main \ # --project_config $PROJECT_CONFIG \ # --samples-file $SAMPLES \ # --steps rumen_drep \ # --ref_mags_dir $RUMEN_REF_DIR/drep_output/dereplicated_genomes \ # --batch_size 1 \ # --cpus 32 \ # --memory "100G" \ # --time "0-12:00:00" \ # --log_dir $LOG_DIR # ============================================================================ # PHASE 7: NOVEL MAG PROCESSING # ============================================================================ # Step 8.1: Identify Novel MAGs echo "[$(date)] Step 8.1: Identifying novel MAGs..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps identify_novel_mags \ --custom_gtdbtk_dir $WORK_DIR/Novel_Mags/gtdbtk \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir $LOG_DIR # Step 8.2: Process Novel MAGs (General data) echo "[$(date)] Step 8.2: Processing novel MAGs..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps process_novel_mags \ --kraken_db_path $KRAKEN_DB \ --merge_mode \ --batch_size 1 \ --cpus 64 \ --memory "128G" \ --time "0-12:00:00" \ --log_dir $LOG_DIR # # For rumen data, use this instead: # python3 -m MetaMAG.main \ # --project_config $PROJECT_CONFIG \ # --steps process_novel_mags \ # --is_rumen_data \ # --rumen_ref_mags_dir $RUMEN_REF_DIR \ # --rumen_added_mags_dir $RUMEN_ADDED_DIR \ # --kraken_db_path $KRAKEN_DB \ # --merge_mode \ # --batch_size 1 \ # --cpus 64 \ # --memory "128G" \ # --time "0-12:00:00" \ # --log_dir $LOG_DIR # Step 8.3: Add MAGs to Repository echo "[$(date)] Step 8.3: Adding MAGs to repository..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps add_mags_to_repo \ --batch_size 1 \ --cpus 8 \ --memory "16G" \ --time "0-01:00:00" \ --log_dir $LOG_DIR # Step 8.4: Build/Update Kraken Database echo "[$(date)] Step 8.4: Building Kraken database..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps build_kraken_db \ --kraken_db_path $KRAKEN_DB \ --batch_size 1 \ --cpus 64 \ --memory "200G" \ --time "1-00:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 8: FUNCTIONAL ANNOTATION # ============================================================================ # Step 9.1: EggNOG Annotation echo "[$(date)] Step 9.1: Running EggNOG annotation..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps eggnog_annotation \ --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \ --batch_size 1 \ --cpus 32 \ --memory "64G" \ --time "0-12:00:00" \ --log_dir $LOG_DIR # Step 9.2: dbCAN CAZyme Annotation echo "[$(date)] Step 9.2: Running dbCAN annotation..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps dbcan_annotation \ --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \ --batch_size 1 \ --cpus 32 \ --memory "32G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # Step 9.3: Integrated Functional Analysis echo "[$(date)] Step 9.3: Running functional analysis..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps functional_analysis \ --kegg_db_file $KEGG_DB \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir $LOG_DIR # Step 9.4: Advanced Visualizations echo "[$(date)] Step 9.4: Creating advanced visualizations..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps advanced_visualizations \ --viz_output_suffix "_final" \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-04:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PHASE 9: ABUNDANCE ESTIMATION # ============================================================================ # Step 10: Comprehensive Abundance Estimation echo "[$(date)] Step 10: Running abundance estimation..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --samples-file $SAMPLES \ --steps abundance_estimation \ --kraken_db_path $KRAKEN_DB \ --taxonomic_levels S G F \ --read_length 150 \ --threshold 10 \ --batch_size 50 \ --cpus 32 \ --memory "64G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # # With metadata for group comparisons: # python3 -m MetaMAG.main \ # --project_config $PROJECT_CONFIG \ # --samples-file $SAMPLES \ # --steps abundance_estimation \ # --kraken_db_path $KRAKEN_DB \ # --taxonomic_levels S G F \ # --metadata_file $WORK_DIR/metadata.csv \ # --batch_size 50 \ # --cpus 32 \ # --memory "64G" \ # --time "0-06:00:00" \ # --log_dir $LOG_DIR # ============================================================================ # PHASE 10: PHYLOGENETIC ANALYSIS (OPTIONAL) # ============================================================================ # Step 11.1: Generate Phylogenetic Tree echo "[$(date)] Step 11.1: Generating phylogenetic tree..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps mags_tree \ --mags_dir $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes \ --outgroup_taxon "p__Firmicutes" \ --batch_size 1 \ --cpus 32 \ --memory "100G" \ --time "0-06:00:00" \ --log_dir $LOG_DIR # Step 11.2: Basic Tree Visualization echo "[$(date)] Step 11.2: Creating basic tree visualization..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps tree_visualization \ --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir $LOG_DIR # Step 11.3: Enhanced Tree Visualization with Annotations echo "[$(date)] Step 11.3: Creating enhanced tree visualizations..." python3 -m MetaMAG.main \ --project_config $PROJECT_CONFIG \ --steps enhanced_tree_visualization \ --taxonomy_source $WORK_DIR/Novel_Mags/gtdbtk/gtdbtk.bac120.summary.tsv \ --novel_mags_file $WORK_DIR/Novel_Mags/novel_mags_list.txt \ --cazyme_annotations $WORK_DIR/Functional_Annotation/cazyme_summary.xlsx \ --cog_annotations $WORK_DIR/Functional_Annotation/cog_summary.xlsx \ --kegg_annotations $WORK_DIR/Functional_Annotation/kegg_summary.xlsx \ --annotation_type auto \ --batch_size 1 \ --cpus 8 \ --memory "32G" \ --time "0-02:00:00" \ --log_dir $LOG_DIR # ============================================================================ # PIPELINE COMPLETE # ============================================================================ echo "==========================================" echo "MetaMAG Explorer Pipeline Complete!" echo "End time: $(date)" echo "Output directory: $WORK_DIR" echo "" echo "Key outputs:" echo " - QC reports: $WORK_DIR/QC" echo " - Assemblies: $WORK_DIR/Assembly" echo " - MAGs: $WORK_DIR/Bin_Refinement/drep/dRep_output/dereplicated_genomes" echo " - Taxonomy: $WORK_DIR/Novel_Mags/gtdbtk" echo " - Annotations: $WORK_DIR/Functional_Annotation" echo " - Abundances: $WORK_DIR/Kraken_Abundance" echo " - Visualizations: $WORK_DIR/Advanced_visualizations_final" echo "=========================================="
Script Usage Tips
- Comment out steps you don't need (use # at the beginning of lines)
- Uncomment optional sections like rumen processing if needed
- Adjust SLURM parameters (#SBATCH lines) based on your cluster
- Modify resource allocations (cpus, memory, time) per step as needed
- Check that all database paths are correctly set before running
- Consider running phases separately for better monitoring
đĄ Recommended Workflow Execution
- Phase 1-2: QC, Preprocessing, Assembly (1-2 days)
- Phase 3-4: Binning, Quality Assessment (1 day)
- Phase 5-7: Taxonomy, Novel MAGs (1 day)
- Phase 8: Functional Annotation (1 day)
- Phase 9-10: Abundance, Phylogenetics (1 day)
Total estimated time: 5-7 days for a complete run with 50-100 samples
Troubleshooting
Common Troubleshooting Steps
- Check log files in
$LOG_DIR
- Verify input data quality
- Ensure all dependencies are installed
- Adjust computational resources as needed