Quick Start GuideΒΆ

This guidStep 3: Run the PipelineΒΆ

Standard Pipeline:

# Run the complete pipeline
python scripts/run_pipeline.py run-pipeline
    --input-dir ./input
    --output-dir ./output_q30
    --quality 30
    --verbose

Enhanced Quality Control Pipeline (Recommended for aDNA):

# Step 1: Run standard pipeline
python scripts/run_pipeline.py run-pipeline
    --input-dir ./input
    --output-dir ./output_q30
    --quality 30
    --verbose

# Step 2: Apply enhanced quality control
python enhanced_hsd_converter.py

# Results will include:
# - output_q30_final_cleaned.fasta: Cleaned sequences
# - output_q30_final_high_quality.hsd: High-quality HSD file
# - Diversity analysis report

Step 4: Generate ReportsΒΆ

# Generate comprehensive HTML report
python generate_report.py ./output_q30

πŸ†• Enhanced Quality Control FeaturesΒΆ

The enhanced pipeline provides advanced quality control specifically designed for ancient DNA:

What It DoesΒΆ

  • Artifact Removal: Eliminates common aDNA sequencing artifacts

  • Quality Filtering: Applies 70% quality threshold by default

  • Diversity Analysis: Comprehensive genetic diversity assessment

  • Sample Prioritization: Identifies highest-quality samples automatically

When to UseΒΆ

Use the enhanced pipeline when:

  • Working with ancient DNA samples

  • Need optimal HSD files for haplogroup analysis

  • Want comprehensive quality assessment

  • Require sample prioritization for downstream analysis

Quality MetricsΒΆ

The enhanced pipeline provides detailed metrics:

  • Variant Statistics: Range and distribution of variants per sample

  • Sample Similarity: Genetic similarity analysis between samples

  • Quality Flags: Automatic detection of potential quality issues

  • Retention Rates: Percentage of samples passing quality filtersl get you up and running with the Sanger DNA Damage Analysis Pipeline in just a few minutes.

Warning

Important: Tool Purpose & Scope

This pipeline is designed for preliminary screening and haplogroup prioritization, not for definitive ancient DNA authentication. Use this tool to:

  • Prioritize promising samples for NGS analysis

  • Assess sequence quality and damage indicators

  • Guide resource allocation decisions

Definitive aDNA authentication requires NGS-based methods with proper controls and contamination assessment.

πŸš€ 5-Minute Quick StartΒΆ

PrerequisitesΒΆ

  • Python 3.8+ installed

  • MAFFT installed (see Installation for details)

  • AB1 sequencing files ready to analyze

Step 1: InstallationΒΆ

# Clone and install
git clone https://github.com/allyssonallan/sanger_adna_damage.git
cd sanger_adna_damage

# Set up environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

Step 2: Prepare Your DataΒΆ

# Create project directory
mkdir my_analysis
cd my_analysis

# Create input directory and add your AB1 files
mkdir input
# Copy your .ab1 files to the input/ directory

Step 3: Run the PipelineΒΆ

# Run complete analysis
python -m src.sanger_pipeline.cli.main run-pipeline \\
    --input-dir ./input \\
    --output-dir ./output \\
    --config ../config/default_config.yaml

Step 4: View ResultsΒΆ

# Generate interactive QC report
python -m src.sanger_pipeline.cli.main generate-report \\
    --output-dir ./output \\
    --open-browser

That’s it! Your browser will open with a beautiful interactive report showing all your results.

πŸ“Š Understanding Your ResultsΒΆ

Output Directory StructureΒΆ

After running the pipeline, your output directory will contain:

output/
β”œβ”€β”€ fasta/              # Raw FASTA conversions from AB1
β”‚   β”œβ”€β”€ sample1_F.fasta
β”‚   β”œβ”€β”€ sample1_R.fasta
β”‚   └── ...
β”œβ”€β”€ filtered/           # Quality-filtered sequences
β”‚   β”œβ”€β”€ sample1_F_filtered.fasta
β”‚   β”œβ”€β”€ sample1_R_filtered.fasta
β”‚   └── ...
β”œβ”€β”€ consensus/          # Consensus sequences by HVS region
β”‚   β”œβ”€β”€ sample1_HVS1_consensus.fasta
β”‚   β”œβ”€β”€ sample1_HVS2_consensus.fasta
β”‚   β”œβ”€β”€ sample1_HVS3_consensus.fasta
β”‚   └── ...
β”œβ”€β”€ final/              # Final merged sequences
β”‚   β”œβ”€β”€ sample1_final.fasta
β”‚   └── ...
β”œβ”€β”€ damage_analysis/    # Ancient DNA damage analysis
β”‚   β”œβ”€β”€ sample1_damage_analysis.json
β”‚   └── ...
β”œβ”€β”€ plots/              # Quality score visualizations
β”‚   β”œβ”€β”€ sample1_F_quality.png
β”‚   └── ...
└── reports/            # Interactive HTML reports
    └── qc_report_TIMESTAMP.html

Key Result FilesΒΆ

Final Sequences (final/ directory)

Your processed, consensus sequences ready for downstream analysis

Damage Analysis (damage_analysis/ directory)

JSON files containing ancient DNA damage assessments and statistics

QC Reports (reports/ directory)

Interactive HTML reports with comprehensive analysis summaries

πŸ“ˆ Interpreting the QC ReportΒΆ

The interactive QC report includes several key sections:

Overview TabΒΆ

  • Processing Summary: Files processed, success rates, errors

  • Quality Metrics: Average quality scores, sequence lengths

  • HVS Region Coverage: Which hypervariable regions were successfully processed

Damage Analysis TabΒΆ

  • Damage Assessment: Overall damage score and interpretation

  • Statistical Significance: Bootstrap analysis results (p-values)

  • Damage Patterns: Visual representation of Cβ†’T and Gβ†’A transitions

  • Quality Indicators: Confidence metrics for damage assessment

Quality Control TabΒΆ

  • Sequence Quality: Distribution of Phred quality scores

  • Length Distribution: Sequence length statistics

  • Processing Efficiency: Success rates by processing stage

Sample Details TabΒΆ

  • Individual Results: Per-sample breakdown of all metrics

  • HVS Region Analysis: Detailed results for each hypervariable region

  • File Processing: Status and results for each input file

πŸ” Common ScenariosΒΆ

Scenario 1: Basic AnalysisΒΆ

You have AB1 files and want a standard analysis:

# Simple run with default settings
python -m src.sanger_pipeline.cli.main run-pipeline \\
    --input-dir ./my_ab1_files \\
    --output-dir ./results

Scenario 2: Custom Quality ThresholdΒΆ

You want stricter quality filtering:

# Copy and edit config
cp ../config/default_config.yaml my_config.yaml

# Edit quality_threshold in my_config.yaml (e.g., change to 25)

# Run with custom config
python -m src.sanger_pipeline.cli.main run-pipeline \\
    --input-dir ./my_ab1_files \\
    --output-dir ./results \\
    --config ./my_config.yaml

Scenario 3: Ancient DNA AssessmentΒΆ

You specifically want to assess ancient DNA damage:

# Run pipeline with focus on damage analysis
python -m src.sanger_pipeline.cli.main run-pipeline \\
    --input-dir ./ancient_samples \\
    --output-dir ./ancient_results

# Generate detailed damage report
python -m src.sanger_pipeline.cli.main analyze-damage \\
    --input-dir ./ancient_results/final \\
    --output-dir ./ancient_results/damage_analysis

πŸ› οΈ Command Line InterfaceΒΆ

Key CommandsΒΆ

run-pipeline: Complete analysis pipeline

python -m src.sanger_pipeline.cli.main run-pipeline [OPTIONS]

generate-report: Create QC reports

python -m src.sanger_pipeline.cli.main generate-report [OPTIONS]

analyze-damage: Damage analysis only

python -m src.sanger_pipeline.cli.main analyze-damage [OPTIONS]

status: Check pipeline status

python -m src.sanger_pipeline.cli.main status [OPTIONS]

Common OptionsΒΆ

  • --input-dir: Directory containing AB1 files

  • --output-dir: Directory for results

  • --config: Configuration file path

  • --quality-threshold: Override quality threshold

  • --open-browser: Open report in browser automatically

  • --help: Show help for any command

πŸ”§ Configuration BasicsΒΆ

The configuration file controls pipeline behavior. Key settings:

Quality ControlΒΆ

quality_threshold: 20        # Minimum Phred quality score
min_sequence_length: 50      # Minimum sequence length

Damage AnalysisΒΆ

damage_threshold: 0.05       # Significance threshold for damage
bootstrap_iterations: 10000  # Bootstrap analysis iterations

HVS RegionsΒΆ

hvs_regions:
  HVS1:
    start: 16024
    end: 16365
  HVS2:
    start: 57
    end: 372
  HVS3:
    start: 438
    end: 574

⚑ Performance Tips¢

For Large DatasetsΒΆ

  1. Use Quality Pre-filtering: Set appropriate quality thresholds to reduce processing time

  2. Monitor Memory Usage: Large datasets may require more RAM

  3. Batch Processing: Process samples in batches if memory is limited

For Ancient DNAΒΆ

  1. Use Conservative Settings: Lower quality thresholds may be appropriate

  2. Focus on Damage Analysis: Use the damage analysis tools extensively

  3. Multiple Replicates: Analyze multiple extractions when possible

πŸ†˜ Quick TroubleshootingΒΆ

Pipeline Fails to StartΒΆ

# Check installation
python -c "from src.sanger_pipeline.core.pipeline import SangerPipeline"

# Check external dependencies
mafft --version

No AB1 Files FoundΒΆ

# Check file extensions and directory
ls -la input/

# Ensure files have .ab1 extension

Quality IssuesΒΆ

# Lower quality threshold temporarily
python -m src.sanger_pipeline.cli.main run-pipeline \\
    --input-dir ./input \\
    --output-dir ./output \\
    --quality-threshold 15

Memory ErrorsΒΆ

# Process smaller batches
# Split AB1 files into smaller groups

# Monitor memory usage
top  # or htop on Linux

🎯 Next Steps¢

Now that you’ve run your first analysis:

  1. Explore Configuration: Configuration - Customize pipeline behavior

  2. Learn Advanced Features: Tutorials - Detailed tutorials

  3. Understand Damage Analysis: Understanding Damage Analysis - Deep dive into aDNA analysis

  4. API Reference: API Reference - For programmatic usage

  5. Troubleshooting: Troubleshooting - Solve common issues

🀝 Getting Help¢

  • Documentation: Browse these docs for detailed information

  • GitHub Issues: Report bugs or request features

  • Community: Join discussions and get help from other users

Congratulations! You’ve successfully run the Sanger DNA Damage Analysis Pipeline. The interactive QC report provides a comprehensive overview of your results, and you’re ready to dive deeper into ancient DNA analysis.