Sanger DNA Damage Analysis PipelineΒΆ

Python Version MIT License Documentation

A comprehensive, modular pipeline for processing Sanger sequencing AB1 files, including quality control, alignment, consensus building, ancient DNA damage analysis, and enhanced quality control for optimal haplogroup classification.

Important

IMPORTANT DISCLAIMER - Tool Purpose & Limitations

This pipeline is NOT a tool for authenticating ancient DNA samples. It is designed for:

  • Prioritizing haplogroups for follow-up analysis

  • Evaluating sample quality based on insert size and damage patterns

  • Providing surrogate bootstrapped damage indicators

  • Assisting in haplogroup origin assessment

  • Guiding selection of promising samples for NGS sequencing

⚠️ All ancient DNA authentication must be performed using NGS-based methods with appropriate controls, contamination assessment, and phylogenetic analysis.

This tool provides preliminary screening to help researchers prioritize samples and resources before proceeding to more comprehensive NGS-based ancient DNA authentication workflows.

πŸš€ FeaturesΒΆ

Core PipelineΒΆ

  • Modular Architecture: Well-organized codebase with clear separation of concerns

  • Quality Control: Convert AB1 files with Phred quality filtering and visualization

  • Sequence Processing: Align forward/reverse reads and build consensus sequences

  • HVS Region Processing: Independent processing of HVS1, HVS2, and HVS3 regions with intelligent merging

  • Ancient DNA Analysis: Comprehensive aDNA damage pattern detection and assessment

  • Statistical Validation: Bootstrap analysis for damage assessment

  • Beautiful QC Reports: Interactive HTML reports with charts, tables, and analysis summaries

  • Command Line Interface: Easy-to-use CLI for all pipeline operations

  • Extensible Design: Easy to add new analysis modules and features

Enhanced Quality Control (NEW!)ΒΆ

  • aDNA Sequence Cleaning: Advanced removal of ancient DNA artifacts and ambiguous nucleotides

  • Quality Filtering: Configurable quality thresholds with 70% default for optimal results

  • Diversity Analysis: Comprehensive genetic diversity assessment and sample comparison

  • Sample Prioritization: Automated identification of highest-quality samples for downstream analysis

  • Quality Metrics: Detailed reports on variant counts, sample similarity, and potential quality issues

  • Artifact Detection: Advanced detection and removal of alignment and sequencing artifacts

HSD Conversion MethodsΒΆ

  • Regional Hybrid Method: Optimal approach with 52.4 average variants per sample (recommended)

  • Direct Method: Alternative approach with 66.0 average variants per sample

  • Enhanced Converter: Improved quality control with artifact detection and filtering

πŸ“š Documentation ContentsΒΆ

🎯 Quick Start¢

InstallationΒΆ

# Clone the repository
git clone https://github.com/allyssonallan/sanger_adna_damage.git
cd sanger_adna_damage

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt

# Install the package in development mode
pip install -e .

Basic UsageΒΆ

# Run the complete pipeline
python -m src.sanger_pipeline.cli.main run-pipeline \\
    --input-dir ./input \\
    --output-dir ./output \\
    --config ./config/default_config.yaml

# Generate comprehensive QC report
python -m src.sanger_pipeline.cli.main generate-report \\
    --output-dir ./output \\
    --open-browser

πŸ“Š Pipeline OverviewΒΆ

The pipeline processes Sanger sequencing data through multiple quality-controlled stages with comprehensive branching for different quality control approaches and output formats:

        graph TB
    subgraph "Input Stage"
        A[πŸ“ AB1 Files<br/>- Forward reads<br/>- Reverse reads<br/>- HVS1/2/3 regions]
    end

    subgraph "Core Processing"
        B[πŸ”„ AB1 Conversion<br/>Quality filtering<br/>Phred scores β‰₯ Q20/Q30]
        C[🧹 Quality Control<br/>Length filtering<br/>Base quality assessment]
        D[πŸ”— Consensus Building<br/>Forward/reverse alignment<br/>Per HVS region]
        E[🧩 Region Merging<br/>Combine HVS regions<br/>Sample consolidation]
    end

    subgraph "Analysis & QC"
        F[🧬 Damage Analysis<br/>Cβ†’T, Gβ†’A transitions<br/>Bootstrap statistics<br/>P-value calculation]
        G[πŸ“Š Interactive Reports<br/>HTML dashboard<br/>Quality visualizations<br/>Statistical summaries]
    end

    subgraph "Enhanced Quality Control (v2.0+)"
        H[✨ Enhanced Pipeline Entry]
        I[πŸ§ͺ aDNA Sequence Cleaner<br/>- Remove artifacts<br/>- Resolve ambiguous bases<br/>- Filter poly-N regions<br/>- Quality scoring]
        J[πŸ“ Improved HSD Converter<br/>- Reference alignment<br/>- Quality metrics<br/>- Variant filtering<br/>- Statistical validation]
        K[πŸ“ˆ Diversity Analyzer<br/>- Haplogroup diversity<br/>- Sample comparison<br/>- Quality assessment<br/>- Priority ranking]
    end

    subgraph "Output Options"
        L1[πŸ“‹ Standard HSD<br/>Basic variant calling<br/>Regional/Direct methods]
        L2[🎯 Enhanced HSD<br/>Quality-filtered variants<br/>Statistical confidence<br/>Diversity metrics]
        L3[πŸ“Š Quality Reports<br/>Interactive dashboards<br/>Damage plots<br/>Statistical summaries]
        L4[πŸ“ Processed Sequences<br/>FASTA files<br/>Consensus sequences<br/>Quality scores]
    end

    subgraph "Configuration Variables"
        V1[βš™οΈ Quality Thresholds<br/>--min-quality: 15-30<br/>--min-length: 30-100bp<br/>--quality-filter: 0.6-0.8]
        V2[πŸ”§ Pipeline Options<br/>--alignment-tool: mafft/muscle<br/>--damage-threshold: 0.02<br/>--bootstrap-iterations: 1000]
        V3[πŸ“‚ I/O Directories<br/>--input-dir: AB1 files<br/>--output-dir: Results<br/>--config: YAML settings]
    end

    %% Main workflow
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G

    %% Enhanced workflow branch
    E -.-> H
    H --> I
    I --> J
    J --> K

    %% Output generation
    E --> L1
    F --> L3
    G --> L3
    J --> L2
    K --> L2
    D --> L4
    E --> L4

    %% Configuration influences
    V1 -.-> B
    V1 -.-> C
    V1 -.-> I
    V2 -.-> D
    V2 -.-> F
    V2 -.-> J
    V3 -.-> A
    V3 -.-> L1
    V3 -.-> L2
    V3 -.-> L3
    V3 -.-> L4

    %% Alternative paths
    G -.-> L1
    L3 -.-> H

    %% Styling
    style A fill:#e3f2fd
    style H fill:#fff3e0
    style L2 fill:#e8f5e8
    style F fill:#fce4ec
    style K fill:#f3e5f5

    classDef inputNode fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef coreNode fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef enhancedNode fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef outputNode fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef configNode fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

    class A inputNode
    class B,C,D,E coreNode
    class H,I,J,K enhancedNode
    class L1,L2,L3,L4 outputNode
    class V1,V2,V3 configNode
    

πŸ”¬ Ancient DNA AnalysisΒΆ

The pipeline includes sophisticated ancient DNA damage analysis:

  • Damage Pattern Detection: Identifies characteristic Cβ†’T and Gβ†’A transitions

  • Statistical Validation: Bootstrap analysis with 10,000 iterations

  • Damage Assessment: Quantitative scoring of damage patterns

  • Visual Reports: Damage plots and interactive visualizations

πŸ“ˆ Output StructureΒΆ

The pipeline creates organized output directories:

output/
β”œβ”€β”€ fasta/              # Raw FASTA files converted from AB1
β”œβ”€β”€ filtered/           # Quality-filtered sequences
β”œβ”€β”€ consensus/          # Consensus sequences for each HVS region
β”œβ”€β”€ aligned/            # Intermediate alignment files
β”œβ”€β”€ final/              # Merged HVS region sequences
β”œβ”€β”€ damage_analysis/    # aDNA damage analysis results
β”œβ”€β”€ plots/              # Quality score plots
└── reports/            # Interactive HTML QC reports

πŸ“ž Support and ContributingΒΆ

  • Issues: Report bugs and request features on GitHub

  • Discussions: Join community discussions for help and ideas

  • Contributing: See our contributing guide for development setup

  • Documentation: This documentation is built with Sphinx

Indices and TablesΒΆ