Your First AnalysisΒΆ
This tutorial walks you through running your first complete analysis with the Sanger DNA Damage Analysis Pipeline. By the end, youβll have processed AB1 files and generated a comprehensive QC report.
Attention
π― Understanding Tool Purpose
This analysis provides preliminary screening results to help you:
Identify samples with promising damage patterns for NGS follow-up
Assess sequence quality and insert sizes
Prioritize haplogroups for further investigation
Make informed decisions about resource allocation
This is NOT a substitute for proper NGS-based aDNA authentication. Use these results to guide your NGS sample selection and experimental design.
π― Tutorial GoalsΒΆ
By completing this tutorial, you will:
β Run a complete pipeline analysis from start to finish
β Understand the output directory structure
β Generate and interpret an interactive QC report
β Assess ancient DNA damage patterns
β Know how to troubleshoot common issues
β±οΈ Estimated Time: 15-20 minutes
π PrerequisitesΒΆ
Before starting this tutorial:
β Pipeline installed and tested (see Installation)
β At least 2-4 AB1 files (forward and reverse reads for 1-2 samples)
β Basic familiarity with command line
β MAFFT installed and accessible
πΎ Sample Data SetupΒΆ
If you donβt have AB1 files, you can create a test scenario:
# Create tutorial workspace
mkdir sanger_tutorial
cd sanger_tutorial
# Create input directory
mkdir input
# Copy your AB1 files to input/ directory
# Files should be named like: sample1_F.ab1, sample1_R.ab1
Expected File StructureΒΆ
Your input directory should look like:
input/
βββ sample1_F.ab1 # Forward read for sample 1
βββ sample1_R.ab1 # Reverse read for sample 1
βββ sample2_F.ab1 # Forward read for sample 2 (optional)
βββ sample2_R.ab1 # Reverse read for sample 2 (optional)
Note
The pipeline automatically detects forward (F) and reverse (R) reads based on filename patterns.
π Step 1: Configuration SetupΒΆ
Copy the default configuration to your working directory:
# Copy default configuration
cp /path/to/sanger_adna_damage/config/default_config.yaml ./tutorial_config.yaml
View and understand the configuration:
# View configuration
cat tutorial_config.yaml
You should see something like:
quality_threshold: 20
min_sequence_length: 15
damage_threshold: 0.05
bootstrap_iterations: 10000
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
HVS3:
start: 438
end: 574
Configuration Explanation:
quality_threshold: 20
- Keep bases with Q20+ quality (99% accuracy)min_sequence_length: 50
- Sequences must be at least 50bp after filteringdamage_threshold: 0.05
- P-value threshold for damage significancebootstrap_iterations: 10000
- Number of statistical iterations
π§ Step 2: Run the Complete PipelineΒΆ
Now letβs run the complete analysis:
# Run the complete pipeline
python -m src.sanger_pipeline.cli.main run-pipeline \
--input-dir ./input \
--output-dir ./output \
--config ./tutorial_config.yaml
What happens during processing:
AB1 Conversion: Binary AB1 files converted to FASTA format
Quality Filtering: Low-quality bases and sequences removed
Sequence Alignment: Forward and reverse reads aligned using MAFFT
Consensus Building: Consensus sequences created for each HVS region
HVS Merging: Available HVS regions combined into final sequences
Damage Analysis: Ancient DNA damage patterns analyzed
Expected Output:
Starting Sanger pipeline...
Processing AB1 files...
β Converted sample1_F.ab1 to FASTA
β Converted sample1_R.ab1 to FASTA
β Quality filtering completed
β Sequence alignment completed
β Consensus sequences generated
β HVS regions merged
β Damage analysis completed
Pipeline completed successfully!
ποΈ Step 3: Explore the Output StructureΒΆ
After successful completion, examine your output directory:
# Explore the output structure
tree output/
You should see:
output/
βββ fasta/ # Raw FASTA conversions
β βββ sample1_F.fasta
β βββ sample1_R.fasta
β βββ ...
βββ filtered/ # Quality-filtered sequences
β βββ sample1_F_filtered.fasta
β βββ sample1_R_filtered.fasta
β βββ ...
βββ consensus/ # Consensus by HVS region
β βββ sample1_HVS1_consensus.fasta
β βββ sample1_HVS2_consensus.fasta
β βββ sample1_HVS3_consensus.fasta
β βββ ...
βββ aligned/ # Intermediate alignments
β βββ sample1_aligned.fasta
β βββ ...
βββ final/ # Final merged sequences
β βββ sample1_final.fasta
β βββ ...
βββ damage_analysis/ # Ancient DNA damage analysis
β βββ sample1_damage_analysis.json
β βββ ...
βββ plots/ # Quality visualizations
βββ sample1_F_quality.png
βββ ...
Directory Explanations:
fasta/: Raw conversions from AB1 format
filtered/: Quality-filtered sequences ready for analysis
consensus/: Consensus sequences for each HVS region independently
final/: Your final processed sequences (main results)
damage_analysis/: Ancient DNA damage assessment results
plots/: Quality score visualizations
π Step 4: Generate Interactive QC ReportΒΆ
Create a comprehensive QC report with visualizations:
# Generate interactive QC report
python -m src.sanger_pipeline.cli.main generate-report \
--output-dir ./output \
--open-browser
This command will:
Analyze all pipeline outputs
Generate statistical summaries
Create interactive visualizations
Open the report in your default browser
Expected Browser Output:
The report opens with several tabs:
Overview: Processing summary and key metrics
Damage Analysis: Ancient DNA damage assessment
Quality Control: Sequence quality distributions
Sample Details: Per-sample detailed results
π Step 5: Interpret Your ResultsΒΆ
Overview Tab AnalysisΒΆ
Look for these key metrics:
β Samples Processed: 2/2 (100%)
β Average Quality Score: 28.5
β HVS Regions Detected: HVS1, HVS2, HVS3
β Total Sequences: 4 (2 samples Γ 2 reads)
Good indicators: * High success rate (>90%) * Quality scores >20 * Multiple HVS regions detected
Damage Analysis TabΒΆ
Key damage metrics to examine:
Damage Score: 0.23 (Low-Moderate)
P-value: 0.045 (Significant)
CβT Transitions: 12%
GβA Transitions: 8%
Interpretation: * Damage Score 0-0.3: Low damage (modern DNA or well-preserved) * Damage Score 0.3-0.7: Moderate damage (possible ancient DNA) * Damage Score >0.7: High damage (likely ancient DNA) * P-value <0.05: Statistically significant damage pattern
Quality Control TabΒΆ
Examine quality distributions:
Mean Quality: Should be >20 for reliable analysis
Length Distribution: Check if sequences meet minimum length
Processing Efficiency: High success rates indicate good data
Sample Details TabΒΆ
Per-sample breakdown shows:
Individual quality metrics
HVS region coverage
Damage assessment for each sample
File processing status
π Step 6: Examine Individual ResultsΒΆ
Look at specific output files:
Final SequencesΒΆ
# View your final processed sequences
cat output/final/sample1_final.fasta
Example output:
>sample1_HVS1_HVS2_final
GATTTCACGGAGGATGGTGGTCAAGGGACCCCCCCTCCCCCATGCTTACAAGCAAGTACA...
Damage Analysis ResultsΒΆ
# View damage analysis (formatted JSON)
python -c "
import json
with open('output/damage_analysis/sample1_damage_analysis.json') as f:
data = json.load(f)
print(json.dumps(data, indent=2))
"
Example output:
{
"sample_id": "sample1",
"damage_score": 0.23,
"p_value": 0.045,
"c_to_t_rate": 0.12,
"g_to_a_rate": 0.08,
"assessment": "Low-Moderate damage detected",
"significance": "statistically_significant"
}
Quality PlotsΒΆ
# View quality plots (if you have image viewer)
open output/plots/sample1_F_quality.png # macOS
# or
xdg-open output/plots/sample1_F_quality.png # Linux
π― Step 7: Understanding Your ResultsΒΆ
Success IndicatorsΒΆ
Your analysis was successful if:
β All AB1 files were converted to FASTA
β Quality filtering produced sequences >50bp
β At least one HVS region was successfully processed
β Final sequences were generated
β QC report generated without errors
Ancient DNA AssessmentΒΆ
Based on damage analysis results:
Modern DNA Indicators: * Damage score <0.2 * P-value >0.05 (not significant) * Low CβT and GβA rates (<5%)
Ancient DNA Indicators: * Damage score >0.3 * P-value <0.05 (significant) * Elevated CβT and GβA rates (>10%)
Borderline Cases: * Damage score 0.2-0.3 * May need additional validation * Consider sample preservation conditions
π§ Step 8: Check Pipeline StatusΒΆ
Get a summary of pipeline results:
# Check overall pipeline status
python -m src.sanger_pipeline.cli.main status \
--output-dir ./output
Expected output:
Pipeline Status Report
=====================
Input Files: 4 AB1 files detected
β FASTA Conversion: 4/4 successful
β Quality Filtering: 4/4 passed
β Consensus Building: 6/6 HVS regions processed
β Final Sequences: 2/2 samples completed
β Damage Analysis: 2/2 samples analyzed
HVS Region Coverage:
- HVS1: 100% (2/2 samples)
- HVS2: 100% (2/2 samples)
- HVS3: 50% (1/2 samples)
Quality Summary:
- Average Quality Score: 28.5
- Average Sequence Length: 245bp
- Overall Success Rate: 100%
π¨ Troubleshooting Common IssuesΒΆ
Issue 1: No AB1 Files FoundΒΆ
Error: No AB1 files found in input directory
Solution:
# Check file extensions and naming
ls -la input/
# Ensure files have .ab1 extension
# Rename if necessary:
mv sample1_forward.AB1 sample1_F.ab1
Issue 2: Quality Filtering Removes All SequencesΒΆ
Error: No sequences passed quality filtering
Solution:
# Lower quality threshold temporarily
python -m src.sanger_pipeline.cli.main run-pipeline \
--input-dir ./input \
--output-dir ./output_lowq \
--quality-threshold 15
Issue 3: MAFFT Not FoundΒΆ
Error: MAFFT executable not found
Solution:
# Check MAFFT installation
mafft --version
# Install if missing
# macOS: brew install mafft
# Ubuntu: sudo apt install mafft
Issue 4: Empty Final SequencesΒΆ
Problem: Final sequences are very short or missing
Diagnosis:
# Check intermediate results
ls -la output/filtered/
cat output/filtered/sample1_F_filtered.fasta
Solution: Lower quality threshold or check input data quality
π Congratulations!ΒΆ
Youβve successfully completed your first Sanger DNA analysis! You now have:
β Processed AB1 files into high-quality sequences
β Generated consensus sequences for HVS regions
β Assessed ancient DNA damage patterns
β Created a comprehensive QC report
β Learned to interpret key results
π― Next StepsΒΆ
Now that youβve completed your first analysis:
Explore Advanced Features: Try ancient_dna_workflow for specialized ancient DNA analysis
Customize Configuration: Learn about Configuration options
Batch Processing: Process multiple samples with batch_processing
Understand Damage Analysis: Deep dive into damage_assessment
π Practice ExercisesΒΆ
To reinforce your learning:
Try Different Quality Thresholds: Re-run with quality thresholds of 15, 25, and 30
Analyze Different Sample Types: Process both modern and potentially ancient samples
Compare Results: Run the same samples with different configurations
Explore Command Options: Try different CLI commands and options
π‘ Key TakeawaysΒΆ
The pipeline processes AB1 files through multiple quality-controlled stages
Interactive QC reports provide comprehensive analysis summaries
Damage analysis helps identify ancient DNA patterns
Configuration files allow customization for different sample types
Each processing stage has specific outputs that can be examined individually
Youβre now ready to tackle more complex analyses and explore the advanced features of the Sanger DNA Damage Analysis Pipeline!