Configuration¶
The Sanger DNA Damage Analysis Pipeline uses YAML configuration files to control its behavior. This guide explains all configuration options and how to customize them for your specific needs.
Tip
Configuration for Screening Workflows
Remember that this pipeline is designed for sample screening and prioritization. Configure parameters to optimize for identifying promising samples for NGS follow-up, not for definitive authentication.
📁 Configuration Files¶
Default Configuration¶
The pipeline comes with a default configuration file at config/default_config.yaml
. This contains sensible defaults for most use cases:
# config/default_config.yaml
quality_threshold: 20
min_sequence_length: 30
damage_threshold: 0.05
bootstrap_iterations: 10000
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
HVS3:
start: 438
end: 574
Custom Configuration¶
Create your own configuration file for specific projects:
# Copy default configuration
cp config/default_config.yaml my_project_config.yaml
# Edit with your preferred editor
nano my_project_config.yaml
# Use in pipeline
python -m src.sanger_pipeline.cli.main run-pipeline \\
--config my_project_config.yaml \\
--input-dir ./input \\
--output-dir ./output
🔧 Configuration Parameters¶
Quality Control Settings¶
quality_threshold¶
Type: Integer Default: 20 Range: 0-40+ Description: Minimum Phred quality score for sequence filtering
quality_threshold: 20 # Keep bases with Q20+ (99% accuracy)
Usage Guidelines:
Modern DNA (Q30+): Use 25-30 for high-quality modern samples
Ancient DNA (Q15-20): Use 15-20 for degraded ancient samples
Exploratory (Q10-15): Use 10-15 for initial assessment of poor samples
min_sequence_length¶
Type: Integer Default: 50 Range: 10-1000+ Description: Minimum sequence length after quality filtering
min_sequence_length: 50 # Sequences must be at least 50bp
Usage Guidelines:
Standard Analysis: 50-100bp minimum for reliable analysis
Ancient DNA: 30-50bp for highly degraded samples
High Quality: 100-200bp for modern, high-quality samples
Ancient DNA Analysis Settings¶
damage_threshold¶
Type: Float Default: 0.05 Range: 0.001-0.1 Description: P-value threshold for damage assessment significance
damage_threshold: 0.05 # 5% significance level
Usage Guidelines:
Conservative: 0.01 (1%) for strict damage assessment
Standard: 0.05 (5%) for typical analysis
Liberal: 0.1 (10%) for exploratory analysis
bootstrap_iterations¶
Type: Integer Default: 10000 Range: 1000-100000 Description: Number of bootstrap iterations for damage analysis
bootstrap_iterations: 10000 # 10,000 iterations
Usage Guidelines:
Quick Testing: 1000-5000 iterations
Standard Analysis: 10000 iterations
High Precision: 50000-100000 iterations (slower)
HVS Region Definitions¶
hvs_regions¶
Type: Dictionary Description: Coordinates for hypervariable regions of mitochondrial DNA
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
HVS3:
start: 438
end: 574
Customization:
You can modify these coordinates or add new regions:
hvs_regions:
HVS1:
start: 16000 # Extended HVS1 region
end: 16400
HVS2:
start: 50 # Extended HVS2 region
end: 400
CUSTOM_REGION: # Add custom region
start: 1000
end: 1500
🎯 Configuration Templates¶
Ancient DNA Configuration¶
Optimized for degraded ancient DNA samples:
# ancient_dna_config.yaml
# Relaxed quality filtering for degraded samples
quality_threshold: 15
min_sequence_length: 30
# Sensitive damage detection
damage_threshold: 0.1
bootstrap_iterations: 50000
# Standard HVS regions
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
Modern DNA Configuration¶
Optimized for high-quality modern samples:
# modern_dna_config.yaml
# Strict quality filtering
quality_threshold: 30
min_sequence_length: 100
# Conservative damage detection (expecting no damage)
damage_threshold: 0.01
bootstrap_iterations: 10000
# Standard HVS regions
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
HVS3:
start: 438
end: 574
Exploratory Configuration¶
For initial assessment of unknown samples:
# exploratory_config.yaml
# Permissive quality filtering
quality_threshold: 10
min_sequence_length: 25
# Liberal damage detection
damage_threshold: 0.1
bootstrap_iterations: 5000
# All HVS regions
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
HVS3:
start: 438
end: 574
High-Throughput Configuration¶
For processing large numbers of samples quickly:
# high_throughput_config.yaml
# Balanced quality filtering
quality_threshold: 20
min_sequence_length: 50
# Fast damage analysis
damage_threshold: 0.05
bootstrap_iterations: 5000 # Reduced for speed
# Focus on most informative regions
hvs_regions:
HVS1:
start: 16024
end: 16365
HVS2:
start: 57
end: 372
🧪 Validation and Testing¶
Configuration Validation¶
Test your configuration before running large analyses:
# Validate configuration syntax
python -c "import yaml; yaml.safe_load(open('my_config.yaml'))"
# Test with pipeline status command
python -m src.sanger_pipeline.cli.main status --config my_config.yaml
# Run on small test dataset
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./test_data \\
--output-dir ./test_output \\
--config my_config.yaml
Parameter Testing¶
Test different parameter values systematically:
# Test different quality thresholds
for threshold in 15 20 25 30; do
echo "Testing quality threshold: $threshold"
python -m src.sanger_pipeline.cli.main run-pipeline \\
--input-dir ./test_data \\
--output-dir ./output_q${threshold} \\
--quality-threshold $threshold
done
🔍 Advanced Configuration¶
Environment Variables¶
Some settings can be controlled via environment variables:
# Override configuration file location
export SANGER_CONFIG=/path/to/my/config.yaml
# Set temporary directory
export TMPDIR=/path/to/large/temp/space
# Control memory usage
export MAX_MEMORY_GB=8
Command Line Overrides¶
You can override configuration values from the command line:
# Override quality threshold
python -m src.sanger_pipeline.cli.main run-pipeline \\
--config my_config.yaml \\
--quality-threshold 25 \\
--input-dir ./input \\
--output-dir ./output
Configuration Validation Schema¶
The pipeline validates configuration files against a schema. Required fields:
# Minimum required configuration
quality_threshold: 20
damage_threshold: 0.05
hvs_regions:
HVS1:
start: 16024
end: 16365
🔄 Configuration Management¶
Version Control¶
Track your configuration files in version control:
# Add configuration to git
git add my_project_config.yaml
git commit -m "Add project-specific configuration"
Multiple Configurations¶
Organize configurations by project or sample type:
configs/
├── default_config.yaml
├── ancient_dna/
│ ├── permafrost_samples.yaml
│ └── cave_samples.yaml
├── modern_dna/
│ ├── reference_samples.yaml
│ └── population_study.yaml
└── exploratory/
└── unknown_samples.yaml
Configuration Documentation¶
Document your custom configurations:
# ancient_permafrost_config.yaml
# Configuration for ancient DNA from permafrost samples
# Created: 2024-01-15
# Author: Research Team
# Purpose: Optimized for highly degraded permafrost samples
quality_threshold: 12 # Very permissive due to degradation
min_sequence_length: 25 # Short fragments expected
damage_threshold: 0.1 # Liberal due to expected damage
bootstrap_iterations: 50000 # High precision for publication
⚠️ Common Configuration Issues¶
YAML Syntax Errors¶
# ❌ Incorrect indentation
hvs_regions:
HVS1:
start: 16024
# ✅ Correct indentation
hvs_regions:
HVS1:
start: 16024
Invalid Parameter Values¶
# ❌ Invalid quality threshold
quality_threshold: 45 # Too high
# ✅ Valid quality threshold
quality_threshold: 25 # Reasonable for high-quality samples
Missing Required Fields¶
# ❌ Missing required fields
quality_threshold: 20
# Missing hvs_regions!
# ✅ All required fields
quality_threshold: 20
damage_threshold: 0.05
hvs_regions:
HVS1:
start: 16024
end: 16365
🎯 Best Practices¶
Start with Defaults: Use the default configuration as a starting point
Document Changes: Comment your modifications and reasoning
Test Thoroughly: Validate configurations on small datasets first
Version Control: Track configuration changes alongside code
Project-Specific: Create separate configurations for different projects
Parameter Testing: Systematically test different parameter values
Backup Configs: Keep copies of working configurations
📊 Performance Tuning¶
Quality vs. Speed Trade-offs¶
# Fast processing (lower quality)
quality_threshold: 15
bootstrap_iterations: 5000
# High quality (slower processing)
quality_threshold: 25
bootstrap_iterations: 50000
Memory Optimization¶
# For large datasets, reduce memory usage
min_sequence_length: 100 # Filter short sequences early
quality_threshold: 25 # Strict filtering reduces data volume
🔗 Integration with Other Tools¶
Export Configuration¶
# Convert to other formats for external tools
python -c "
import yaml, json
with open('my_config.yaml') as f:
config = yaml.safe_load(f)
with open('my_config.json', 'w') as f:
json.dump(config, f, indent=2)
"
Pipeline Integration¶
# Use in automated pipelines
CONFIG_FILE="configs/production_config.yaml"
python -m src.sanger_pipeline.cli.main run-pipeline \\
--config "$CONFIG_FILE" \\
--input-dir "$INPUT_DIR" \\
--output-dir "$OUTPUT_DIR"
This comprehensive configuration system allows you to fine-tune the pipeline for your specific research needs, from quick exploratory analyses to publication-ready ancient DNA assessments.