Understanding Damage Analysis¶

This guide provides a comprehensive understanding of ancient DNA damage analysis in the Sanger DNA Damage Analysis Pipeline, covering the scientific background, methods, and interpretation of results.

Danger

🚨 Critical Limitation: Authentication vs. Screening

This pipeline provides SURROGATE damage indicators for SCREENING purposes only.

The damage analysis in this tool:

✅ Good for: Sample prioritization, quality assessment, resource allocation decisions
✅ Good for: Identifying promising samples for NGS follow-up
✅ Good for: Preliminary haplogroup assessment based on sequence content
❌ NOT sufficient for: Definitive ancient DNA authentication
❌ NOT a replacement for: Proper NGS-based aDNA analysis with controls
❌ NOT suitable for: Publication-quality authentication without NGS validation

All serious aDNA authentication requires NGS with appropriate blank controls, phylogenetic analysis, and contamination assessment.

🧬 Scientific Background¶

Ancient DNA and Degradation¶

Ancient DNA (aDNA) undergoes characteristic chemical degradation over time, primarily through:

Depurination: Loss of purine bases (A and G) leaving apurinic sites that can cause strand breaks during PCR amplification.
Cytosine Deamination: Spontaneous hydrolytic deamination of cytosine to uracil, which is read as thymine during PCR, resulting in C→T transitions.
5-methylcytosine Deamination: Deamination of 5-methylcytosine to thymine, also causing C→T transitions, particularly common in CpG dinucleotides.

Characteristic Damage Patterns¶

Ancient DNA shows distinctive damage patterns:

5’ C→T Transitions: High frequency of C→T misincorporations at the 5’ ends of sequences
3’ G→A Transitions: High frequency of G→A misincorporations at the 3’ ends of sequences
Position-Dependent Damage: Damage rates decrease with distance from sequence ends
Strand Asymmetry: Different damage patterns on plus and minus strands

Note

These patterns are caused by depurination during DNA extraction and library preparation, where ancient templates with apurinic sites are converted to C→T and G→A misincorporations.

📊 Damage Detection Methods¶

Position-Based Analysis¶

The pipeline analyzes damage by examining base transitions at different positions:

5’ End Analysis (positions 1-20):

Count C→T transitions in each position
Calculate transition frequencies
Compare against background rates

3’ End Analysis (positions -20 to -1):

Count G→A transitions in each position
Calculate transition frequencies
Compare against background rates

Middle Positions (control):

Calculate baseline transition rates
Used for comparison with end positions

Damage Score Calculation¶

The damage score is computed as:

\[\text{Damage Score} = \frac{(\text{5' C→T rate} + \text{3' G→A rate})}{2} - \text{Background rate}\]

Where: - 5’ C→T rate: Average C→T frequency in positions 1-5 - 3’ G→A rate: Average G→A frequency in positions -5 to -1 - Background rate: Average transition rate in middle positions

Score Interpretation: - 0.0-0.2: Minimal damage (modern DNA) - 0.2-0.4: Low-moderate damage (recent/well-preserved) - 0.4-0.7: High damage (ancient DNA likely) - >0.7: Very high damage (definite ancient DNA)

🎲 Statistical Validation¶

Bootstrap Analysis¶

The pipeline uses bootstrap resampling to assess statistical significance:

Resampling: Create 10,000 bootstrap samples from the original data
Recalculation: Calculate damage score for each bootstrap sample
Distribution: Build distribution of bootstrap damage scores
P-value: Calculate probability of observing score by chance

Bootstrap Process:

for i in range(10000):
    bootstrap_sample = resample(original_sequences)
    bootstrap_score = calculate_damage_score(bootstrap_sample)
    bootstrap_scores.append(bootstrap_score)

p_value = sum(score >= observed_score for score in bootstrap_scores) / 10000

Significance Testing¶

Null Hypothesis: Observed damage patterns are due to random sequencing errors

Alternative Hypothesis: Observed damage patterns indicate authentic ancient DNA

P-value Interpretation: - p < 0.01: Highly significant ancient DNA damage - p < 0.05: Significant ancient DNA damage - p < 0.10: Marginally significant - p ≥ 0.10: Not significant (modern DNA pattern)

🔍 Damage Pattern Recognition¶

Authentic Ancient DNA Patterns¶

Characteristic Features:

High 5’ C→T rates (>15% in first few positions)
High 3’ G→A rates (>10% in last few positions)
Exponential decay from sequence ends toward middle
Strand asymmetry (different patterns on forward/reverse strands)
Statistical significance (p < 0.05)

Visual Indicators in damage plots: - Sharp peaks at sequence ends - Gradual decline toward sequence middle - Clear asymmetry between 5’ and 3’ ends

Modern DNA Patterns¶

Characteristic Features:

Low transition rates (<5% across all positions)
Uniform distribution (no position-dependent effects)
Random error patterns (not systematically at ends)
No strand asymmetry
No statistical significance (p > 0.05)

Contamination Patterns¶

Mixed Ancient/Modern: - Intermediate damage scores (0.2-0.4) - Irregular position-dependent patterns - Variable significance levels

Modern Contamination: - Lower damage scores than expected - Reduced statistical significance - Inconsistent patterns across samples

📈 Interpreting Damage Analysis Results¶

JSON Output Structure¶

The damage analysis produces detailed JSON output:

{
  "sample_id": "sample001",
  "total_sequences": 245,
  "total_bases": 62847,
  "damage_score": 0.34,
  "p_value": 0.0234,
  "assessment": "Moderate damage detected",
  "significance": "statistically_significant",
  "c_to_t_rate": 0.187,
  "g_to_a_rate": 0.156,
  "background_rate": 0.034,
  "position_data": {
    "5_prime": [0.21, 0.18, 0.15, 0.12, 0.09, ...],
    "3_prime": [0.17, 0.14, 0.11, 0.08, 0.06, ...]
  },
  "bootstrap_stats": {
    "mean": 0.032,
    "std": 0.018,
    "confidence_95": [0.028, 0.036]
  }
}

Key Fields Explained:

damage_score: Overall damage assessment (0-1 scale)
p_value: Statistical significance of damage pattern
assessment: Human-readable damage interpretation
c_to_t_rate: C→T transition rate at 5’ ends
g_to_a_rate: G→A transition rate at 3’ ends
position_data: Damage rates by sequence position
bootstrap_stats: Statistical validation results

Result Classification¶

High Confidence Ancient DNA: .. code-block:: json

{
“damage_score”: 0.67, “p_value”: 0.0001, “assessment”: “High damage detected - likely ancient DNA”

}

Moderate Confidence: .. code-block:: json

{
“damage_score”: 0.28, “p_value”: 0.0423, “assessment”: “Moderate damage detected”

}

Low Confidence/Modern: .. code-block:: json

{
“damage_score”: 0.12, “p_value”: 0.2567, “assessment”: “Low damage detected - consistent with modern DNA”

}

🎨 Visualization and Plots¶

Damage Profile Plots¶

The QC report includes interactive damage profile plots showing:

5’ End Damage (C→T transitions): - X-axis: Position from 5’ end (1-20) - Y-axis: C→T transition frequency (%) - Expected: High values at position 1, exponential decay

3’ End Damage (G→A transitions): - X-axis: Position from 3’ end (-20 to -1) - Y-axis: G→A transition frequency (%) - Expected: High values at position -1, exponential decay

Combined Damage Plot: - Shows both 5’ and 3’ patterns together - Highlights asymmetric damage patterns - Includes confidence intervals from bootstrap analysis

Statistical Validation Plots¶

Bootstrap Distribution: - Histogram of bootstrap damage scores - Observed score marked as vertical line - P-value calculation visualization

Confidence Intervals: - 95% confidence intervals for damage estimates - Error bars on position-specific damage rates - Statistical significance indicators

⚙️ Configuration Parameters¶

Damage Analysis Settings¶

Key configuration parameters that affect damage analysis:

# Damage analysis configuration
damage_threshold: 0.05        # P-value significance threshold
bootstrap_iterations: 10000   # Number of bootstrap samples
min_sequence_length: 50       # Minimum length for damage analysis
position_range: 20            # Positions to analyze from each end

# Advanced settings
background_region: [20, -20]  # Positions for background calculation
transition_types: ["C>T", "G>A"]  # Transition types to analyze
confidence_level: 0.95        # Confidence interval level

Parameter Effects:

Higher bootstrap_iterations: More precise p-values (slower)
Lower damage_threshold: More conservative significance testing
Larger position_range: Analyzes more positions from ends
Higher min_sequence_length: Excludes short, potentially unreliable sequences

🧪 Quality Control and Validation¶

Internal Quality Checks¶

The pipeline performs several quality control checks:

Sequence Length Validation: Ensures sequences are long enough for reliable analysis
Coverage Assessment: Checks that sufficient positions have adequate coverage
Bootstrap Convergence: Verifies bootstrap analysis has converged
Outlier Detection: Identifies and flags unusual damage patterns

Quality Flags: - insufficient_coverage: Too few sequences or positions - short_sequences: Average sequence length below threshold - bootstrap_warning: Bootstrap analysis may be unreliable - outlier_pattern: Unusual damage distribution

External Validation¶

Cross-Sample Consistency: - Compare damage patterns across samples from same context - Look for consistent archaeological/temporal patterns - Check for batch effects or processing artifacts

Positive/Negative Controls: - Include known ancient samples (positive controls) - Include modern DNA samples (negative controls) - Compare results with established methods

🔬 Advanced Interpretation¶

Age Estimation¶

While damage patterns can indicate ancient DNA, they cannot precisely determine age:

General Guidelines: - Very high damage (>0.6): Likely >1000 years old - High damage (0.4-0.6): Potentially 500-1000 years - Moderate damage (0.2-0.4): Recent or well-preserved - Low damage (<0.2): Modern or extremely well-preserved

Warning

Age estimation from damage is approximate and depends on preservation conditions, temperature, pH, and other environmental factors.

Preservation Assessment¶

Damage patterns can indicate preservation quality:

Excellent Preservation: - Low damage despite age - Even coverage across regions - High sequence quality

Poor Preservation: - High damage relative to age - Uneven damage patterns - Low sequence quality

Variable Preservation: - Inconsistent damage across samples - Position-dependent quality variations - May indicate heterogeneous conditions

🎯 Best Practices¶

Sample Processing¶

Include Controls: Always process positive and negative controls
Replicate Extractions: Process multiple extractions when possible
Document Context: Record archaeological/environmental context
Blind Analysis: Analyze samples without knowing expected results

Data Interpretation¶

Consider Context: Interpret results in light of archaeological context
Multiple Lines of Evidence: Use damage analysis alongside other authenticity criteria
Conservative Approach: Be cautious with borderline results
Expert Review: Have results reviewed by experienced researchers

Reporting Standards¶

Full Disclosure: Report all samples, including failures
Method Details: Describe analysis parameters and settings
Statistical Results: Include p-values and confidence intervals
Visual Evidence: Provide damage profile plots
Raw Data: Make underlying data available for review

🚨 Common Pitfalls¶

Interpretation Errors¶

Over-interpretation: - Calling modern DNA “ancient” based on borderline damage - Ignoring statistical significance - Not considering preservation context

Under-interpretation: - Dismissing significant damage patterns - Requiring unrealistically high damage levels - Ignoring consistent patterns across samples

Technical Issues¶

Sample Preparation: - Contamination with modern DNA - PCR artifacts mimicking damage - Library preparation effects

Analysis Parameters: - Inappropriate quality thresholds - Insufficient bootstrap iterations - Wrong reference sequences

📚 Further Reading¶

Key Publications:

Briggs et al. (2007) - Patterns of damage in genomic DNA sequences from a Neandertal
Skoglund et al. (2014) - Separating ancient DNA from modern contamination
Jónsson et al. (2013) - mapDamage2.0: fast approximate Bayesian estimates

Related Methods: - mapDamage: Alternative damage analysis software - PMDtools: Authentication based on damage patterns - EAGER: Ancient DNA analysis pipeline

This comprehensive guide provides the theoretical background and practical knowledge needed to understand and interpret ancient DNA damage analysis results from the Sanger DNA Damage Analysis Pipeline.