Process Single SampleΒΆ
Goal: Process a single AB1 sample pair (forward and reverse reads) through the complete pipeline to generate final consensus sequences and damage assessment.
Level: π’ Beginner
Time: 10-15 minutes
Prerequisites: * Pipeline installed and configured * One sample with forward and reverse AB1 files * Basic command line familiarity
π Sample PreparationΒΆ
Expected File StructureΒΆ
For this guide, weβll process a single sample called βsample001β:
project/
βββ input/
β βββ sample001_F.ab1 # Forward read
β βββ sample001_R.ab1 # Reverse read
βββ output/ # Will be created
File Naming RequirementsΒΆ
The pipeline automatically detects paired reads based on filename patterns:
Forward reads: Must contain
_F
,_forward
, or_1
Reverse reads: Must contain
_R
,_reverse
, or_2
Valid naming examples:
sample001_F.ab1 / sample001_R.ab1
sample001_forward.ab1 / sample001_reverse.ab1
sample001_1.ab1 / sample001_2.ab1
MySample_F.ab1 / MySample_R.ab1
π§ Setup and ConfigurationΒΆ
Create Working Directory
# Create project directory mkdir single_sample_analysis cd single_sample_analysis # Create input directory mkdir input # Copy your AB1 files cp /path/to/your/sample001_F.ab1 input/ cp /path/to/your/sample001_R.ab1 input/
Copy Configuration File
# Copy default configuration cp /path/to/sanger_adna_damage/config/default_config.yaml ./sample_config.yaml
Verify Input Files
# Check your input files ls -la input/ # Should show: # sample001_F.ab1 # sample001_R.ab1
π Step-by-Step ProcessingΒΆ
Step 1: Validate SetupΒΆ
Before running the analysis, validate your setup:
# Check if pipeline can find your files
python -m src.sanger_pipeline.cli.main validate \
--check-input ./input \
--config ./sample_config.yaml
Expected output:
β Configuration file is valid
β Input directory exists
β Found 2 AB1 files (1 sample pair)
β External dependencies available
Validation passed!
Step 2: Run the Complete PipelineΒΆ
Process your sample through all pipeline stages:
# Run complete analysis
python -m src.sanger_pipeline.cli.main run-pipeline \
--input-dir ./input \
--output-dir ./output \
--config ./sample_config.yaml \
--verbose
Processing stages (watch for these in the output):
Starting Sanger pipeline...
[1/6] Converting AB1 files to FASTA...
β Converted sample001_F.ab1 (285 bases, avg quality: 32.4)
β Converted sample001_R.ab1 (298 bases, avg quality: 29.8)
[2/6] Applying quality filtering...
β sample001_F: 267 bases retained (93.7%)
β sample001_R: 275 bases retained (92.3%)
[3/6] Aligning forward and reverse reads...
β sample001: Successfully aligned using MAFFT
[4/6] Building consensus sequences...
β sample001_HVS1: 142 bases
β sample001_HVS2: 198 bases
β sample001_HVS3: 89 bases
[5/6] Merging HVS regions...
β sample001: Combined HVS1+HVS2+HVS3 (429 bases total)
[6/6] Analyzing damage patterns...
β sample001: Damage score = 0.15 (p-value = 0.23)
Pipeline completed successfully!
Step 3: Examine Output StructureΒΆ
Explore what the pipeline created:
# View the complete output structure
tree output/
Output explanation:
output/
βββ fasta/ # Raw FASTA conversions
β βββ sample001_F.fasta # Forward read as FASTA
β βββ sample001_R.fasta # Reverse read as FASTA
βββ filtered/ # Quality-filtered sequences
β βββ sample001_F_filtered.fasta
β βββ sample001_R_filtered.fasta
βββ aligned/ # Aligned forward+reverse reads
β βββ sample001_aligned.fasta
βββ consensus/ # Consensus by HVS region
β βββ sample001_HVS1_consensus.fasta
β βββ sample001_HVS2_consensus.fasta
β βββ sample001_HVS3_consensus.fasta
βββ final/ # β Your main result
β βββ sample001_final.fasta
βββ damage_analysis/ # Ancient DNA assessment
β βββ sample001_damage_analysis.json
βββ plots/ # Quality visualizations
βββ sample001_F_quality.png
βββ sample001_R_quality.png
π Examine Your ResultsΒΆ
Step 4: View Final SequenceΒΆ
Your main result is the final consensus sequence:
# View your final processed sequence
cat output/final/sample001_final.fasta
Example output:
>sample001_HVS1_HVS2_HVS3_final
GATTTCACGGAGGATGGTGGTCAAGGGACCCCCCCTCCCCCATGCTTACAAGCAAGTACA
TGTTTGTTTGAGATGCTTTGCTCACCCCCTCTCTTTGTTTGCTTTGGAGCACTTGGAACC
GATGGTGCTGGTTCCGGAGCCCTGTTTATCCACCTTGTTTCCCCTGTATTCCATCTCTAC
CTTCCAACCCATTCCCACCCCACTCGTTGGTGAATCTTATTTTTCGGTTAGAGTCCCACC
CTGTGTGACCCTGCTTGTGATGCCGTTAGAGATGGTAACAGAGGTTATCATGCTTCCCTA
GGCTACTACTGTGCAAGGCCCCCATTTGTTCAATGGAAAGATTTCGTTGATCCGTGTGAC
CTGGAAACAGGCAAAGATGGGGATGATGGCGCCTCTAGGATAATAGGGCGTGTTTCACGG
AGGATGGTGGTCAAGGGACCCCCCCTCCCCCATGCTTACAAGCAAGTACATG
Step 5: Check Damage AnalysisΒΆ
Examine the ancient DNA damage assessment:
# View damage analysis results (formatted)
python -c "
import json
with open('output/damage_analysis/sample001_damage_analysis.json') as f:
data = json.load(f)
print('Sample:', data['sample_id'])
print('Damage Score:', data['damage_score'])
print('P-value:', data['p_value'])
print('Assessment:', data['assessment'])
print('CβT rate:', f\"{data['c_to_t_rate']:.2%}\")
print('GβA rate:', f\"{data['g_to_a_rate']:.2%}\")
"
Example output:
Sample: sample001
Damage Score: 0.15
P-value: 0.23
Assessment: Low damage detected
CβT rate: 8.50%
GβA rate: 6.20%
Interpretation: * Damage Score 0.15: Low damage (consistent with modern DNA) * P-value 0.23: Not statistically significant (p > 0.05) * Assessment: Low damage suggests modern DNA or well-preserved sample
Step 6: Generate Interactive ReportΒΆ
Create a comprehensive QC report:
# Generate interactive HTML report
python -m src.sanger_pipeline.cli.main generate-report \
--output-dir ./output \
--title "Sample001 Analysis Report" \
--open-browser
This opens a detailed report in your browser with:
Overview: Processing summary and quality metrics
Damage Analysis: Detailed damage assessment with plots
Quality Control: Sequence quality distributions
Sample Details: Per-file processing results
π Understanding Your ResultsΒΆ
Quality MetricsΒΆ
Check these key indicators in your report:
# Get quick status summary
python -m src.sanger_pipeline.cli.main status \
--output-dir ./output \
--detailed
Good quality indicators: * β Both forward and reverse reads processed successfully * β Average quality scores >20 * β Sequence lengths >50bp after filtering * β Multiple HVS regions detected
HVS Region CoverageΒΆ
Your sample should ideally cover multiple HVS regions:
HVS1 only: Partial coverage, adequate for basic analysis
HVS1 + HVS2: Good coverage for most applications
HVS1 + HVS2 + HVS3: Excellent coverage for comprehensive analysis
Damage Assessment InterpretationΒΆ
Based on your damage analysis results:
Modern DNA Pattern (damage score <0.3, p>0.05): * Low CβT and GβA transition rates * Even damage distribution * High confidence in sequence authenticity
Potential Ancient DNA (damage score >0.3, p<0.05): * Elevated CβT transitions at 5β ends * Elevated GβA transitions at 3β ends * Characteristic ancient DNA damage pattern
Borderline Cases (damage score 0.2-0.4): * May need additional validation * Consider sample age and preservation conditions * Additional quality controls recommended
π View Quality PlotsΒΆ
Examine quality score distributions:
# View quality plots (if you have image viewer)
open output/plots/sample001_F_quality.png # macOS
# or
xdg-open output/plots/sample001_F_quality.png # Linux
These plots show: * Quality score distribution along sequence length * Areas of high/low quality * Regions that were filtered out
π§ Troubleshooting Common IssuesΒΆ
Issue 1: Low Quality SequencesΒΆ
Problem: Sequences too short after quality filtering
Solution: Lower quality threshold temporarily
# Re-run with lower quality threshold
python -m src.sanger_pipeline.cli.main run-pipeline \
--input-dir ./input \
--output-dir ./output_lowq \
--quality-threshold 15
Issue 2: No HVS Regions DetectedΒΆ
Problem: No consensus sequences in HVS regions
Check sequence content:
# Examine filtered sequences
cat output/filtered/sample001_F_filtered.fasta
cat output/filtered/sample001_R_filtered.fasta
Solutions: * Check if sequences are mitochondrial DNA * Verify HVS region coordinates in configuration * Consider if sample covers different regions
Issue 3: Alignment FailuresΒΆ
Problem: MAFFT alignment fails
Check:
# Verify MAFFT installation
mafft --version
# Check sequence compatibility
head -n 20 output/filtered/sample001_*_filtered.fasta
Solutions: * Ensure MAFFT is properly installed * Check that sequences are from same organism * Verify sequences have sufficient length
π― Next StepsΒΆ
After successfully processing your single sample:
Analyze More Samples: Use batch_processing for multiple samples
Customize Analysis: Try create_custom_config for different settings
Ancient DNA Focus: Explore assess_damage_patterns for ancient samples
Publication Reports: Create publication-ready outputs with generate_publication_reports
π Success ChecklistΒΆ
Your single sample analysis is successful if:
β Both AB1 files were converted to FASTA
β Quality filtering retained reasonable sequence lengths
β Forward and reverse reads were successfully aligned
β At least one HVS region consensus was generated
β Final merged sequence was created
β Damage analysis completed without errors
β Interactive QC report generated successfully
π Typical Results SummaryΒΆ
For a successful single sample analysis, expect:
Input: 2 AB1 files (forward + reverse)
Output: 1 final consensus sequence
Processing time: 2-5 minutes
Quality retention: 80-95% of original bases
HVS coverage: 1-3 regions depending on sample
Final sequence length: 100-500bp typically
File sizes (approximate): * AB1 files: 50-200KB each * Final FASTA: 1-2KB * Damage analysis JSON: 2-5KB * QC report HTML: 500KB-2MB
This guide provides a complete workflow for processing individual samples, giving you the foundation to understand the pipeline before scaling up to larger analyses.