Quick Start GuideΒΆ
This guidStep 3: Run the PipelineΒΆ
Standard Pipeline:
# Run the complete pipeline
python scripts/run_pipeline.py run-pipeline
--input-dir ./input
--output-dir ./output_q30
--quality 30
--verbose
Enhanced Quality Control Pipeline (Recommended for aDNA):
# Step 1: Run standard pipeline
python scripts/run_pipeline.py run-pipeline
--input-dir ./input
--output-dir ./output_q30
--quality 30
--verbose
# Step 2: Apply enhanced quality control
python enhanced_hsd_converter.py
# Results will include:
# - output_q30_final_cleaned.fasta: Cleaned sequences
# - output_q30_final_high_quality.hsd: High-quality HSD file
# - Diversity analysis report
Step 4: Generate ReportsΒΆ
# Generate comprehensive HTML report
python generate_report.py ./output_q30
π Enhanced Quality Control FeaturesΒΆ
The enhanced pipeline provides advanced quality control specifically designed for ancient DNA:
What It DoesΒΆ
Artifact Removal: Eliminates common aDNA sequencing artifacts
Quality Filtering: Applies 70% quality threshold by default
Diversity Analysis: Comprehensive genetic diversity assessment
Sample Prioritization: Identifies highest-quality samples automatically
When to UseΒΆ
Use the enhanced pipeline when:
Working with ancient DNA samples
Need optimal HSD files for haplogroup analysis
Want comprehensive quality assessment
Require sample prioritization for downstream analysis
Quality MetricsΒΆ
The enhanced pipeline provides detailed metrics:
Variant Statistics: Range and distribution of variants per sample
Sample Similarity: Genetic similarity analysis between samples
Quality Flags: Automatic detection of potential quality issues
Retention Rates: Percentage of samples passing quality filtersl get you up and running with the Sanger DNA Damage Analysis Pipeline in just a few minutes.
Warning
Important: Tool Purpose & Scope
This pipeline is designed for preliminary screening and haplogroup prioritization, not for definitive ancient DNA authentication. Use this tool to:
Prioritize promising samples for NGS analysis
Assess sequence quality and damage indicators
Guide resource allocation decisions
Definitive aDNA authentication requires NGS-based methods with proper controls and contamination assessment.
π 5-Minute Quick StartΒΆ
PrerequisitesΒΆ
Python 3.8+ installed
MAFFT installed (see Installation for details)
AB1 sequencing files ready to analyze
Step 1: InstallationΒΆ
# Clone and install
git clone https://github.com/allyssonallan/sanger_adna_damage.git
cd sanger_adna_damage
# Set up environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\\Scripts\\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
Step 2: Prepare Your DataΒΆ
# Create project directory
mkdir my_analysis
cd my_analysis
# Create input directory and add your AB1 files
mkdir input
# Copy your .ab1 files to the input/ directory
Step 3: Run the PipelineΒΆ
# Run complete analysis
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./input \\
--output-dir ./output \\
--config ../config/default_config.yaml
Step 4: View ResultsΒΆ
# Generate interactive QC report
python -m src.sanger_pipeline.cli.main generate-report \\
--output-dir ./output \\
--open-browser
Thatβs it! Your browser will open with a beautiful interactive report showing all your results.
π Understanding Your ResultsΒΆ
Output Directory StructureΒΆ
After running the pipeline, your output directory will contain:
output/
βββ fasta/ # Raw FASTA conversions from AB1
β βββ sample1_F.fasta
β βββ sample1_R.fasta
β βββ ...
βββ filtered/ # Quality-filtered sequences
β βββ sample1_F_filtered.fasta
β βββ sample1_R_filtered.fasta
β βββ ...
βββ consensus/ # Consensus sequences by HVS region
β βββ sample1_HVS1_consensus.fasta
β βββ sample1_HVS2_consensus.fasta
β βββ sample1_HVS3_consensus.fasta
β βββ ...
βββ final/ # Final merged sequences
β βββ sample1_final.fasta
β βββ ...
βββ damage_analysis/ # Ancient DNA damage analysis
β βββ sample1_damage_analysis.json
β βββ ...
βββ plots/ # Quality score visualizations
β βββ sample1_F_quality.png
β βββ ...
βββ reports/ # Interactive HTML reports
βββ qc_report_TIMESTAMP.html
Key Result FilesΒΆ
- Final Sequences (final/ directory)
Your processed, consensus sequences ready for downstream analysis
- Damage Analysis (damage_analysis/ directory)
JSON files containing ancient DNA damage assessments and statistics
- QC Reports (reports/ directory)
Interactive HTML reports with comprehensive analysis summaries
π Interpreting the QC ReportΒΆ
The interactive QC report includes several key sections:
Overview TabΒΆ
Processing Summary: Files processed, success rates, errors
Quality Metrics: Average quality scores, sequence lengths
HVS Region Coverage: Which hypervariable regions were successfully processed
Damage Analysis TabΒΆ
Damage Assessment: Overall damage score and interpretation
Statistical Significance: Bootstrap analysis results (p-values)
Damage Patterns: Visual representation of CβT and GβA transitions
Quality Indicators: Confidence metrics for damage assessment
Quality Control TabΒΆ
Sequence Quality: Distribution of Phred quality scores
Length Distribution: Sequence length statistics
Processing Efficiency: Success rates by processing stage
Sample Details TabΒΆ
Individual Results: Per-sample breakdown of all metrics
HVS Region Analysis: Detailed results for each hypervariable region
File Processing: Status and results for each input file
π Common ScenariosΒΆ
Scenario 1: Basic AnalysisΒΆ
You have AB1 files and want a standard analysis:
# Simple run with default settings
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./my_ab1_files \\
--output-dir ./results
Scenario 2: Custom Quality ThresholdΒΆ
You want stricter quality filtering:
# Copy and edit config
cp ../config/default_config.yaml my_config.yaml
# Edit quality_threshold in my_config.yaml (e.g., change to 25)
# Run with custom config
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./my_ab1_files \\
--output-dir ./results \\
--config ./my_config.yaml
Scenario 3: Ancient DNA AssessmentΒΆ
You specifically want to assess ancient DNA damage:
# Run pipeline with focus on damage analysis
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./ancient_samples \\
--output-dir ./ancient_results
# Generate detailed damage report
python -m src.sanger_pipeline.cli.main analyze-damage \\
--input-dir ./ancient_results/final \\
--output-dir ./ancient_results/damage_analysis
π οΈ Command Line InterfaceΒΆ
Key CommandsΒΆ
run-pipeline: Complete analysis pipeline
python -m src.sanger_pipeline.cli.main run-pipeline [OPTIONS]
generate-report: Create QC reports
python -m src.sanger_pipeline.cli.main generate-report [OPTIONS]
analyze-damage: Damage analysis only
python -m src.sanger_pipeline.cli.main analyze-damage [OPTIONS]
status: Check pipeline status
python -m src.sanger_pipeline.cli.main status [OPTIONS]
Common OptionsΒΆ
--input-dir
: Directory containing AB1 files--output-dir
: Directory for results--config
: Configuration file path--quality-threshold
: Override quality threshold--open-browser
: Open report in browser automatically--help
: Show help for any command
π§ Configuration BasicsΒΆ
The configuration file controls pipeline behavior. Key settings:
Quality ControlΒΆ
quality_threshold: 20 # Minimum Phred quality score
min_sequence_length: 50 # Minimum sequence length
Damage AnalysisΒΆ
damage_threshold: 0.05 # Significance threshold for damage
bootstrap_iterations: 10000 # Bootstrap analysis iterations
HVS RegionsΒΆ
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
HVS3:
start: 438
end: 574
β‘ Performance TipsΒΆ
For Large DatasetsΒΆ
Use Quality Pre-filtering: Set appropriate quality thresholds to reduce processing time
Monitor Memory Usage: Large datasets may require more RAM
Batch Processing: Process samples in batches if memory is limited
For Ancient DNAΒΆ
Use Conservative Settings: Lower quality thresholds may be appropriate
Focus on Damage Analysis: Use the damage analysis tools extensively
Multiple Replicates: Analyze multiple extractions when possible
π Quick TroubleshootingΒΆ
Pipeline Fails to StartΒΆ
# Check installation
python -c "from src.sanger_pipeline.core.pipeline import SangerPipeline"
# Check external dependencies
mafft --version
No AB1 Files FoundΒΆ
# Check file extensions and directory
ls -la input/
# Ensure files have .ab1 extension
Quality IssuesΒΆ
# Lower quality threshold temporarily
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./input \\
--output-dir ./output \\
--quality-threshold 15
Memory ErrorsΒΆ
# Process smaller batches
# Split AB1 files into smaller groups
# Monitor memory usage
top # or htop on Linux
π― Next StepsΒΆ
Now that youβve run your first analysis:
Explore Configuration: Configuration - Customize pipeline behavior
Learn Advanced Features: Tutorials - Detailed tutorials
Understand Damage Analysis: Understanding Damage Analysis - Deep dive into aDNA analysis
API Reference: API Reference - For programmatic usage
Troubleshooting: Troubleshooting - Solve common issues
π€ Getting HelpΒΆ
Documentation: Browse these docs for detailed information
GitHub Issues: Report bugs or request features
Community: Join discussions and get help from other users
Congratulations! Youβve successfully run the Sanger DNA Damage Analysis Pipeline. The interactive QC report provides a comprehensive overview of your results, and youβre ready to dive deeper into ancient DNA analysis.