A Comprehensive Metagenome-Assembled Genome Analysis Pipeline
MetaMAG Explorer is a comprehensive pipeline designed for the identification, processing, and analysis of Metagenome-Assembled Genomes (MAGs) from metagenomic data. This pipeline excels in detecting novel MAGs and integrating them into reference databases, with specialized handling for rumen microbiome data.
The pipeline incorporates state-of-the-art tools for quality control, assembly, binning, and functional annotation, providing a complete solution for metagenomic analysis from raw reads to functionally annotated genomes and phylogenetic placement.
MetaMAG Explorer is organized into several major modules that can be run sequentially or independently:
The pipeline consists of these main components:
A unique feature of this pipeline is its comprehensive novel MAG detection and database integration module:
This specialized workflow identifies novel MAGs, dereplicates them against existing repositories, and integrates them into reference databases for improved future analyses.
The pipeline begins with quality assessment of raw sequencing reads using FastQC, which generates comprehensive reports on sequence quality, adapter content, and other metrics.
Low-quality bases and adapter sequences are removed using Fastp. This step improves assembly quality by removing error-prone regions of reads.
For metagenomes from host-associated environments (e.g., rumen, human gut), the pipeline maps reads against host reference genomes using BWA-mem and removes host-derived sequences. This step is crucial for accurate microbial genome assembly.
Individual samples are assembled using IDBA-UD, an iterative De Bruijn graph assembler optimized for uneven-depth metagenomic data. This approach works well for capturing dominant genomes in individual samples.
Reads from multiple samples are combined and assembled using MEGAHIT, which is memory-efficient and can handle large datasets. Co-assembly improves recovery of low-abundance genomes that may be present across multiple samples.
MetaQUAST is used to evaluate assembly quality metrics such as N50, total assembly length, and contig length distribution.
The pipeline employs multiple binning algorithms to group contigs into putative genomes:
Results from different binning tools are integrated using DAS Tool, which selects the optimal bins from each binning approach to create a non-redundant set of high-quality MAGs.
dRep is used to identify and remove redundant MAGs based on genome similarity, ensuring a non-redundant set of genomes for downstream analysis.
CheckM2 assesses MAG quality by estimating completeness and contamination based on single-copy marker genes. This quality assessment is crucial for identifying high-quality MAGs for further analysis.
GTDB-Tk assigns taxonomy to MAGs based on the Genome Taxonomy Database. This step is critical for identifying potentially novel genomes that lack species-level assignments.
MAGs with missing species-level assignments in GTDB-Tk results are identified as potentially novel. The pipeline extracts these candidates for further processing.
Novel MAG candidates are dereplicated against a repository of previously identified MAGs to ensure they represent truly unique genomes not already present in the database.
For rumen data, additional dereplication is performed against dedicated rumen reference MAGs. This ensures that MAGs unique to the rumen microbiome are properly identified and processed.
Verified novel MAGs are added to a central MAG repository for future reference and dereplication.
The pipeline integrates novel MAGs into a Kraken2 database for improved metagenomic classification in future analyses. This step includes:
MAGs are functionally annotated using eggNOG-mapper, which assigns orthology groups, KEGG orthology (KO) terms, and Gene Ontology (GO) terms to predicted genes.
The dbCAN tool is used to identify Carbohydrate-Active Enzymes (CAZymes) in MAGs, which are particularly important for understanding polysaccharide metabolism in the rumen microbiome.
KEGG annotations are analyzed to identify key metabolic pathways and functions present in the MAGs, providing insights into their ecological roles.
Phylogenetic trees are constructed based on marker genes to establish evolutionary relationships among the MAGs and reference genomes.
Trees are visualized with taxonomic and functional annotations to facilitate interpretation of evolutionary relationships and functional traits.
The updated Kraken2 database, which includes novel MAGs, is used to estimate the abundance of MAGs across samples, providing insights into community composition and dynamics.
The pipeline includes specialized handling for rumen microbiome data:
This specialized handling improves the characterization of the rumen microbiome, which is crucial for understanding feed efficiency, methane emissions, and overall cattle health.
The pipeline's Kraken database integration module enables continuous improvement of metagenomic classification:
This integration enables improved classification of metagenomic sequences in future analyses by incorporating project-specific novel genomes into the reference database.
The pipeline generates a structured output directory containing the following key files and directories:
output_dir/
+-- QC/ # Quality control reports
+-- Trimming/ # Trimmed read files
+-- Host_Removal/ # Host-filtered read files
+-- Assembly/
¦ +-- IDBA/ # Single-sample assemblies
¦ +-- MEGAHIT/ # Co-assemblies
+-- MetaQUAST/ # Assembly quality evaluation
+-- Binning/ # Raw binning results
+-- Bin_Refinement/ # Refined bins
¦ +-- drep/ # Dereplicated MAGs
+-- Evaluation/ # CheckM2 quality assessment
+-- Novel_Mags/
¦ +-- gtdbtk/ # GTDB-Tk classification results
¦ +-- UniqueMags/ # Candidate novel MAGs
¦ +-- filtered_NMAGs/ # MAGs after repository dereplication
¦ +-- true_novel_MAGs/ # Final set of novel MAGs
+-- MAGs_Repository/ # Central repository of all MAGs
+-- Kraken_Database/ # Kraken2 database with novel MAGs
+-- Functional_Annotation/
¦ +-- eggNOG/ # eggNOG annotation results
¦ +-- dbCAN/ # CAZyme annotations
¦ +-- KEGG/ # KEGG functional analysis
+-- Phylogeny/ # Phylogenetic trees
+-- Abundance/ # MAG abundance estimates
Follow the installation steps and run the pipeline to process your metagenomic data from raw reads to annotated genomes. You can run all steps or just the parts you need.
A: MetaMAG Explorer can process shotgun metagenomic data from any environment, with specialized features for host-associated microbiomes, particularly the rumen microbiome.
A: The current pipeline is designed for short-read Illumina data. Long reads would require different assemblers (like metaFlye or Canu) and modified quality control steps.
A: MAGs are considered potentially novel if they lack a species-level classification in GTDB-Tk results (indicated by "s__" in the taxonomy string). They are confirmed as novel if they pass dereplication against existing MAG repositories.
A: The pipeline can incorporate MAGs from major rumen microbiome projects including RUG (Rumen Uncultured Genomes), RMGMC (Rumen Microbial Genomics Multi-Country), and MGnify rumen MAGs.
A: Rumen-specific processing includes additional dereplication against established rumen reference MAGs and maintains a separate collection of rumen-specific novel MAGs for future reference.
A: Yes, the pipeline can be adapted for other host-associated microbiomes by providing the appropriate host reference genome for host removal and reference MAG collections for dereplication.
A: Resource requirements vary by step and dataset size. As a general guideline:
A: Running steps in parallel, using more CPU cores, and allocating sufficient memory will reduce runtime.
A: Yes, resource requirements (CPU, memory, time) can be customized per step in the configuration file or SLURM job submission script.
A:Yes, the pipeline supports checkpointing and modular execution, so failed steps can be rerun independently.
A:Yes, the pipeline is designed for HPC environments and can distribute tasks across multiple nodes for efficiency.
Issue: Jobs fail with out-of-memory errors.
Solution: Increase the memory allocation in the --memory
parameter or reduce the batch size.
Issue: GTDB-Tk fails with database-related errors.
Solution: Ensure the GTDB-Tk database is properly installed and the environment variable GTDBTK_DATA_PATH is set correctly.
Issue: dRep fails during genome comparison.
Solution: Check if input MAGs meet minimum quality requirements (completeness > 50%, contamination < 10%). Increase memory allocation for dRep.
Issue: Kraken database building fails.
Solution: Ensure sufficient disk space for the database. Check if the taxonomy files (nodes.dmp, names.dmp) are properly formatted.
For additional troubleshooting and support, please open an issue in the GitHub repository or contact the developers directly.