Sanger DNA Damage Analysis PipelineΒΆ
A comprehensive, modular pipeline for processing Sanger sequencing AB1 files, including quality control, alignment, consensus building, ancient DNA damage analysis, and enhanced quality control for optimal haplogroup classification.
Important
IMPORTANT DISCLAIMER - Tool Purpose & Limitations
This pipeline is NOT a tool for authenticating ancient DNA samples. It is designed for:
Prioritizing haplogroups for follow-up analysis
Evaluating sample quality based on insert size and damage patterns
Providing surrogate bootstrapped damage indicators
Assisting in haplogroup origin assessment
Guiding selection of promising samples for NGS sequencing
β οΈ All ancient DNA authentication must be performed using NGS-based methods with appropriate controls, contamination assessment, and phylogenetic analysis.
This tool provides preliminary screening to help researchers prioritize samples and resources before proceeding to more comprehensive NGS-based ancient DNA authentication workflows.
π FeaturesΒΆ
Core PipelineΒΆ
Modular Architecture: Well-organized codebase with clear separation of concerns
Quality Control: Convert AB1 files with Phred quality filtering and visualization
Sequence Processing: Align forward/reverse reads and build consensus sequences
HVS Region Processing: Independent processing of HVS1, HVS2, and HVS3 regions with intelligent merging
Ancient DNA Analysis: Comprehensive aDNA damage pattern detection and assessment
Statistical Validation: Bootstrap analysis for damage assessment
Beautiful QC Reports: Interactive HTML reports with charts, tables, and analysis summaries
Command Line Interface: Easy-to-use CLI for all pipeline operations
Extensible Design: Easy to add new analysis modules and features
Enhanced Quality Control (NEW!)ΒΆ
aDNA Sequence Cleaning: Advanced removal of ancient DNA artifacts and ambiguous nucleotides
Quality Filtering: Configurable quality thresholds with 70% default for optimal results
Diversity Analysis: Comprehensive genetic diversity assessment and sample comparison
Sample Prioritization: Automated identification of highest-quality samples for downstream analysis
Quality Metrics: Detailed reports on variant counts, sample similarity, and potential quality issues
Artifact Detection: Advanced detection and removal of alignment and sequencing artifacts
HSD Conversion MethodsΒΆ
Regional Hybrid Method: Optimal approach with 52.4 average variants per sample (recommended)
Direct Method: Alternative approach with 66.0 average variants per sample
Enhanced Converter: Improved quality control with artifact detection and filtering
π Documentation ContentsΒΆ
Getting Started
- Installation
- Quick Start Guide
- This guidStep 3: Run the Pipeline
- Step 4: Generate Reports
- What It Does
- When to Use
- Quality Metrics
- Prerequisites
- Step 1: Installation
- Step 2: Prepare Your Data
- Step 3: Run the Pipeline
- Step 4: View Results
- Output Directory Structure
- Key Result Files
- Overview Tab
- Damage Analysis Tab
- Quality Control Tab
- Sample Details Tab
- Scenario 1: Basic Analysis
- Scenario 2: Custom Quality Threshold
- Scenario 3: Ancient DNA Assessment
- Key Commands
- Common Options
- Quality Control
- Damage Analysis
- HVS Regions
- For Large Datasets
- For Ancient DNA
- Pipeline Fails to Start
- No AB1 Files Found
- Quality Issues
- Memory Errors
- Configuration
User Guide
- Tutorials
- Enhanced Quality Control
- Complete Pipeline Workflow Reference
- How-To Guides
- CLI Reference
- π₯οΈ Overview
- π Available Commands
- run-pipeline
- generate-report
- analyze-damage
- status
- validate
- convert
- π§ Global Options
- π Configuration via CLI
- π Chaining Commands
- π Exit Codes
- π Environment Variables
- π Debugging and Troubleshooting
- π Scripting Examples
- π Enhanced Quality Control Tools
Advanced Topics
- Understanding Damage Analysis
- 𧬠Scientific Background
- π Damage Detection Methods
- π² Statistical Validation
- π Damage Pattern Recognition
- π Interpreting Damage Analysis Results
- π¨ Visualization and Plots
- βοΈ Configuration Parameters
- π§ͺ Quality Control and Validation
- π¬ Advanced Interpretation
- π― Best Practices
- π¨ Common Pitfalls
- π Further Reading
- Troubleshooting
API Reference
- API Reference
- π Overview
- ποΈ Architecture Overview
- π― Quick Start
- π Module Documentation
- π§ Configuration API
- π Data Structures
- π οΈ Utility Functions
- π Analysis Functions
- π¨ Visualization API
- π Extension Points
- π¨ Error Handling
- π§ͺ Testing Utilities
- π Debugging and Logging
- π Performance Monitoring
- Core Module
Development
π― Quick StartΒΆ
InstallationΒΆ
# Clone the repository
git clone https://github.com/allyssonallan/sanger_adna_damage.git
cd sanger_adna_damage
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\\Scripts\\activate
# Install dependencies
pip install -r requirements.txt
# Install the package in development mode
pip install -e .
Basic UsageΒΆ
# Run the complete pipeline
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./input \\
--output-dir ./output \\
--config ./config/default_config.yaml
# Generate comprehensive QC report
python -m src.sanger_pipeline.cli.main generate-report \\
--output-dir ./output \\
--open-browser
π Pipeline OverviewΒΆ
The pipeline processes Sanger sequencing data through multiple quality-controlled stages with comprehensive branching for different quality control approaches and output formats:
graph TB subgraph "Input Stage" A[π AB1 Files<br/>- Forward reads<br/>- Reverse reads<br/>- HVS1/2/3 regions] end subgraph "Core Processing" B[π AB1 Conversion<br/>Quality filtering<br/>Phred scores β₯ Q20/Q30] C[π§Ή Quality Control<br/>Length filtering<br/>Base quality assessment] D[π Consensus Building<br/>Forward/reverse alignment<br/>Per HVS region] E[π§© Region Merging<br/>Combine HVS regions<br/>Sample consolidation] end subgraph "Analysis & QC" F[𧬠Damage Analysis<br/>CβT, GβA transitions<br/>Bootstrap statistics<br/>P-value calculation] G[π Interactive Reports<br/>HTML dashboard<br/>Quality visualizations<br/>Statistical summaries] end subgraph "Enhanced Quality Control (v2.0+)" H[β¨ Enhanced Pipeline Entry] I[π§ͺ aDNA Sequence Cleaner<br/>- Remove artifacts<br/>- Resolve ambiguous bases<br/>- Filter poly-N regions<br/>- Quality scoring] J[π Improved HSD Converter<br/>- Reference alignment<br/>- Quality metrics<br/>- Variant filtering<br/>- Statistical validation] K[π Diversity Analyzer<br/>- Haplogroup diversity<br/>- Sample comparison<br/>- Quality assessment<br/>- Priority ranking] end subgraph "Output Options" L1[π Standard HSD<br/>Basic variant calling<br/>Regional/Direct methods] L2[π― Enhanced HSD<br/>Quality-filtered variants<br/>Statistical confidence<br/>Diversity metrics] L3[π Quality Reports<br/>Interactive dashboards<br/>Damage plots<br/>Statistical summaries] L4[π Processed Sequences<br/>FASTA files<br/>Consensus sequences<br/>Quality scores] end subgraph "Configuration Variables" V1[βοΈ Quality Thresholds<br/>--min-quality: 15-30<br/>--min-length: 30-100bp<br/>--quality-filter: 0.6-0.8] V2[π§ Pipeline Options<br/>--alignment-tool: mafft/muscle<br/>--damage-threshold: 0.02<br/>--bootstrap-iterations: 1000] V3[π I/O Directories<br/>--input-dir: AB1 files<br/>--output-dir: Results<br/>--config: YAML settings] end %% Main workflow A --> B B --> C C --> D D --> E E --> F F --> G %% Enhanced workflow branch E -.-> H H --> I I --> J J --> K %% Output generation E --> L1 F --> L3 G --> L3 J --> L2 K --> L2 D --> L4 E --> L4 %% Configuration influences V1 -.-> B V1 -.-> C V1 -.-> I V2 -.-> D V2 -.-> F V2 -.-> J V3 -.-> A V3 -.-> L1 V3 -.-> L2 V3 -.-> L3 V3 -.-> L4 %% Alternative paths G -.-> L1 L3 -.-> H %% Styling style A fill:#e3f2fd style H fill:#fff3e0 style L2 fill:#e8f5e8 style F fill:#fce4ec style K fill:#f3e5f5 classDef inputNode fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef coreNode fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef enhancedNode fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef outputNode fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef configNode fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px class A inputNode class B,C,D,E coreNode class H,I,J,K enhancedNode class L1,L2,L3,L4 outputNode class V1,V2,V3 configNode
π¬ Ancient DNA AnalysisΒΆ
The pipeline includes sophisticated ancient DNA damage analysis:
Damage Pattern Detection: Identifies characteristic CβT and GβA transitions
Statistical Validation: Bootstrap analysis with 10,000 iterations
Damage Assessment: Quantitative scoring of damage patterns
Visual Reports: Damage plots and interactive visualizations
π Output StructureΒΆ
The pipeline creates organized output directories:
output/
βββ fasta/ # Raw FASTA files converted from AB1
βββ filtered/ # Quality-filtered sequences
βββ consensus/ # Consensus sequences for each HVS region
βββ aligned/ # Intermediate alignment files
βββ final/ # Merged HVS region sequences
βββ damage_analysis/ # aDNA damage analysis results
βββ plots/ # Quality score plots
βββ reports/ # Interactive HTML QC reports
π Support and ContributingΒΆ
Issues: Report bugs and request features on GitHub
Discussions: Join community discussions for help and ideas
Contributing: See our contributing guide for development setup
Documentation: This documentation is built with Sphinx