Optimizing Performance
Performance optimization in probabilistic programming requires understanding both computational complexity and numerical analysis. This guide explores Fugue's systematic approach to memory optimization, numerical stability, and algorithmic efficiency for production-scale probabilistic workloads.
Probabilistic programs exhibit multi-dimensional complexity:
- Time complexity: for samples, parameters, iterations
- Space complexity: with memory pooling
- Numerical complexity: Condition number affects convergence
Fugue's optimization framework addresses each dimension systematically.
Memory-Optimized Inference
Memory allocation becomes the computational bottleneck in high-throughput scenarios due to garbage collection overhead. The allocation rate for naive inference scales as:
where is the sample count, is the trace size, and is the GC frequency. Fugue's object pooling reduces this to after warmup:
graph TD subgraph "Traditional Allocation" A1["Sample 1"] --> B1["Allocate Trace"] B1 --> C1["GC Pressure"] A2["Sample 2"] --> B2["Allocate Trace"] B2 --> C2["GC Pressure"] A3["Sample n"] --> B3["Allocate Trace"] B3 --> C3["GC Pressure"] end subgraph "Pooled Allocation" D1["Sample 1"] --> E1["Reuse from Pool"] D2["Sample 2"] --> E1 D3["Sample n"] --> E1 E1 --> F["Zero GC Pressure"] end
// Create trace pool for zero-allocation inference
let mut pool = TracePool::new(50); // Pool up to 50 traces
let mut rng = thread_rng();
// Define a model that would normally cause many allocations
let make_model = || {
prob!(
let x <- sample(addr!("x"), Normal::new(0.0, 1.0).unwrap());
let y <- sample(addr!("y"), Normal::new(x, 0.5).unwrap());
observe(addr!("obs"), Normal::new(y, 0.1).unwrap(), 1.5);
pure(x)
)
};
// Time pooled vs non-pooled execution
let start = Instant::now();
for _iteration in 0..1000 {
// Use pooled handler for efficient memory reuse
let (_result, trace) =
runtime::handler::run(PooledPriorHandler::new(&mut rng, &mut pool), make_model());
// Return trace to pool for reuse
pool.return_trace(trace);
}
let pooled_time = start.elapsed();
let stats = pool.stats();
println!("✅ Completed 1000 iterations with memory pooling");
println!(" - Execution time: {:?}", pooled_time);
println!(" - Hit ratio: {:.1}%", stats.hit_ratio());
println!(
" - Pool stats - hits: {}, misses: {}",
stats.hits, stats.misses
);
Key Benefits:
- Zero-allocation execution after warm-up
- Configurable pool size for memory control
- Automatic trace recycling and cleanup
- Built-in performance monitoring with hit ratios
Numerical Stability
Numerical stability in probabilistic computing requires careful analysis of condition numbers and floating-point precision. The log-sum-exp operation is fundamental:
Stability Analysis: Direct computation of has condition number , which becomes ill-conditioned when .
When are large and similar, direct computation suffers from catastrophic cancellation: The LSE formulation maintains relative precision regardless of scale.
// Demonstrate stable log-probability computations
let extreme_log_probs = vec![700.0, 701.0, 699.0, 698.0]; // Would overflow in linear space
// Safe log-sum-exp prevents overflow
let log_normalizer = log_sum_exp(&extreme_log_probs);
let normalized_probs = normalize_log_probs(&extreme_log_probs);
println!("✅ Stable computation with extreme log-probabilities");
println!(" - Log normalizer: {:.2}", log_normalizer);
println!(
" - Probabilities sum to: {:.10}",
normalized_probs.iter().sum::<f64>()
);
// Weighted log-sum-exp for importance sampling
let log_values = vec![-1.0, -2.0, -3.0, -4.0];
let weights = vec![0.4, 0.3, 0.2, 0.1];
let weighted_result = weighted_log_sum_exp(&log_values, &weights);
println!(" - Weighted log-sum-exp: {:.4}", weighted_result);
// Safe logarithm handling
let safe_results: Vec<f64> = [1.0, 0.0, -1.0].iter().map(|&x| safe_ln(x)).collect();
println!(" - Safe ln results: {:?}", safe_results);
Stability Features:
log_sum_exp
prevents overflow in mixture computationsweighted_log_sum_exp
for importance samplingsafe_ln
handles edge cases gracefully- All operations maintain numerical precision across scales
Efficient Trace Construction
When building traces programmatically, use TraceBuilder
for optimal performance:
// Use TraceBuilder for efficient trace creation
let mut builder = TraceBuilder::new();
let start = Instant::now();
for i in 0..100 {
// Add choices efficiently without reallocations
builder.add_sample(
addr!("param", i),
i as f64,
0.0, // log_prob
);
}
// Build final trace efficiently
let constructed_trace = builder.build();
let construction_time = start.elapsed();
println!("✅ Efficient trace construction");
println!(
" - Built trace with {} choices in {:?}",
constructed_trace.choices.len(),
construction_time
);
println!(
" - Total log weight: {:.2}",
constructed_trace.total_log_weight()
);
Construction Benefits:
- Pre-allocated data structures minimize reallocations
- Type-specific insertion methods avoid boxing overhead
- Batch operations for multiple choices
- Efficient conversion to immutable traces
Copy-on-Write for MCMC
MCMC algorithms exhibit temporal locality in parameter updates, modifying only parameters per iteration where is the total dimensionality. Copy-on-Write (COW) data structures exploit this pattern:
graph TD subgraph "MCMC Iteration Structure" A["Base Trace T₀"] --> B{"Proposal Step"} B --> C["Modified Parameters δ"] C --> D{"Small Changes?"} D -->|Yes| E["COW: Share + Δ"] D -->|No| F["Full Copy"] E --> G["O(1) Memory"] F --> H["O(d) Memory"] end
Complexity Analysis: Traditional MCMC requires space per sample. COW reduces this to where is the edit distance between traces.
// Create base trace manually for MCMC
let mut builder = TraceBuilder::new();
builder.add_sample(addr!("mu"), 0.5, -0.5);
builder.add_sample(addr!("sigma"), 1.0, -1.0);
builder.add_sample_bool(addr!("component"), true, -0.69);
let base_trace = builder.build();
// Create COW trace for efficient copying
let cow_base = CowTrace::from_trace(base_trace);
let start = Instant::now();
let mut mcmc_traces = Vec::new();
for _proposal in 0..1000 {
// Clone is O(1) until modification
let mut proposal_trace = cow_base.clone();
// Modify only one parameter (triggers COW)
proposal_trace.insert_choice(
addr!("mu"),
Choice {
addr: addr!("mu"),
value: ChoiceValue::F64(0.6),
logp: -0.4,
},
);
mcmc_traces.push(proposal_trace);
}
let cow_time = start.elapsed();
println!("✅ Copy-on-write MCMC proposals");
println!(" - Created 1000 proposal traces in {:?}", cow_time);
println!(" - Memory sharing until modification");
MCMC Optimizations:
- O(1) trace cloning until modification
- Shared memory for unchanged parameters
- Lazy copying only when traces diverge
- Perfect for Metropolis-Hastings and Gibbs sampling
Vectorized Model Patterns
Structure models for efficient batch processing:
// Pre-allocate data structures for repeated use
let observations: Vec<f64> = (0..100).map(|i| i as f64 * 0.1).collect();
let n = observations.len();
// Efficient vectorized model
let vectorized_model = || {
prob!(
let mu <- sample(addr!("global_mu"), Normal::new(0.0, 10.0).unwrap());
let precision <- sample(addr!("precision"), Gamma::new(2.0, 1.0).unwrap());
let sigma = (1.0 / precision).sqrt();
// Use plate for efficient vectorized operations
let _likelihoods <- plate!(i in 0..n => {
observe(addr!("obs", i), Normal::new(mu, sigma).unwrap(), observations[i])
});
pure((mu, sigma))
)
};
let start = Instant::now();
let (_result, _trace) = runtime::handler::run(
PriorHandler {
rng: &mut rng,
trace: Trace::default(),
},
vectorized_model(),
);
let vectorized_time = start.elapsed();
println!("✅ Optimized vectorized model");
println!(" - Processed {} observations in {:?}", n, vectorized_time);
Vectorization Strategy:
- Pre-allocate data collections
- Use
plate!
for independent parallel operations - Minimize dynamic allocations in hot paths
- Leverage compiler optimizations with static sizing
Performance Monitoring
Systematic performance monitoring requires tracking multiple performance metrics with their theoretical bounds:
where is the theoretical minimum execution time per sample.
Even with perfect parallelization, MCMC exhibits sequential dependencies that limit speedup: where is the fraction of sequential computation and is the number of processors.
// Monitor trace characteristics for optimization insights
#[derive(Debug)]
struct TraceMetrics {
num_choices: usize,
log_weight: f64,
is_valid: bool,
memory_size_estimate: usize,
}
impl TraceMetrics {
fn from_trace(trace: &Trace) -> Self {
let num_choices = trace.choices.len();
let log_weight = trace.total_log_weight();
let is_valid = log_weight.is_finite();
// Rough memory estimate (actual implementation would be more precise)
let memory_size_estimate = num_choices * 64; // Rough bytes per choice
Self {
num_choices,
log_weight,
is_valid,
memory_size_estimate,
}
}
}
// Example: Monitor a complex model's performance
let complex_model = || {
prob!(
let components <- plate!(c in 0..5 => {
sample(addr!("weight", c), Gamma::new(1.0, 1.0).unwrap())
.bind(move |weight| {
sample(addr!("mu", c), Normal::new(0.0, 2.0).unwrap())
.map(move |mu| (weight, mu))
})
});
let selector <- sample(addr!("selector"),
Categorical::new(vec![0.2, 0.2, 0.2, 0.2, 0.2]).unwrap());
pure((components, selector))
)
};
let (_result, trace) = runtime::handler::run(
PriorHandler {
rng: &mut rng,
trace: Trace::default(),
},
complex_model(),
);
let metrics = TraceMetrics::from_trace(&trace);
println!("✅ Performance monitoring active");
println!(" - Trace choices: {}", metrics.num_choices);
println!(" - Log weight: {:.2}", metrics.log_weight);
println!(" - Valid: {}", metrics.is_valid);
println!(
" - Memory estimate: {} bytes",
metrics.memory_size_estimate
);
Monitoring Approach:
- Collect trace characteristics for optimization insights
- Track memory usage patterns
- Validate numerical stability
- Profile execution bottlenecks
Batch Processing
Batch processing amortizes setup costs and exploits hardware parallelism. The optimal batch size balances memory usage and throughput:
where:
- is the per-batch initialization cost
- is the per-sample memory cost
- is the synchronization overhead
graph LR subgraph "Performance vs Batch Size" A["Small Batches<br/>b → 1"] --> B["High Setup<br/>Overhead"] C["Large Batches<br/>b → ∞"] --> D["Memory<br/>Pressure"] E["Optimal Batch<br/>b*"] --> F["Balanced<br/>Performance"] end
// Efficient batch inference using memory pooling
let batch_size = 100;
let mut batch_pool = TracePool::new(batch_size);
let start = Instant::now();
let mut batch_results = Vec::with_capacity(batch_size);
for _batch in 0..batch_size {
let (result, trace) = runtime::handler::run(
PooledPriorHandler::new(&mut rng, &mut batch_pool),
make_model(),
);
// Return trace to pool for reuse
batch_pool.return_trace(trace);
batch_results.push(result);
}
let batch_time = start.elapsed();
let batch_stats = batch_pool.stats();
println!("✅ Batch processing complete");
println!(" - Processed {} samples in {:?}", batch_size, batch_time);
println!(
" - Average time per sample: {:?}",
batch_time / batch_size as u32
);
println!(
" - Memory efficiency: {:.1}% hit ratio",
batch_stats.hit_ratio()
);
Batch Benefits:
- Amortized setup costs across samples
- Memory pool reuse for consistent performance
- Scalable to large sample counts
- Predictable memory footprint
Numerical Precision Testing
Validate stability across different computational scales:
// Test numerical stability across different scales
let test_scales = vec![1e-10, 1e-5, 1.0, 1e5, 1e10];
for &scale in &test_scales {
let scale: f64 = scale;
let log_vals = vec![scale.ln() + 1.0, scale.ln() + 2.0, scale.ln() + 0.5];
let stable_sum = log_sum_exp(&log_vals);
let log1p_result = log1p_exp(scale.ln());
println!(
" Scale {:.0e}: log_sum_exp={:.4}, log1p_exp={:.4}",
scale, stable_sum, log1p_result
);
}
println!("✅ Numerical stability verified across scales");
Testing Strategy:
- Verify stability across extreme value ranges
- Test edge cases and boundary conditions
- Validate consistency of numerical operations
- Profile precision vs. performance trade-offs
Performance Testing
Implement systematic performance validation:
#[test]
fn test_memory_pool_efficiency() {
let mut pool = TracePool::new(10);
let mut rng = thread_rng();
// Test pool reuse with PooledPriorHandler
for _i in 0..20 {
let (_, trace) = runtime::handler::run(
PooledPriorHandler::new(&mut rng, &mut pool),
sample(addr!("test"), Normal::new(0.0, 1.0).unwrap()),
);
// Return trace to pool for reuse
pool.return_trace(trace);
}
let stats = pool.stats();
assert!(
stats.hit_ratio() > 50.0,
"Pool should have good hit ratio, got {:.1}%",
stats.hit_ratio()
);
assert!(stats.hits + stats.misses > 0, "Pool should have been used");
}
#[test]
fn test_numerical_stability() {
// Test log_sum_exp with extreme values
let extreme_vals = vec![700.0, 701.0, 699.0];
let result = log_sum_exp(&extreme_vals);
assert!(
result.is_finite(),
"log_sum_exp should handle extreme values"
);
// Test normalization
let normalized = normalize_log_probs(&extreme_vals);
let sum: f64 = normalized.iter().sum();
assert!(
(sum - 1.0).abs() < 1e-10,
"Normalized probabilities should sum to 1"
);
// Test weighted computation
let weights = vec![0.5, 0.3, 0.2];
let weighted_result = weighted_log_sum_exp(&extreme_vals, &weights);
assert!(
weighted_result.is_finite(),
"Weighted log_sum_exp should be finite"
);
}
#[test]
fn test_trace_builder_efficiency() {
let mut builder = TraceBuilder::new();
// Add many choices efficiently
for i in 0..100 {
builder.add_sample(addr!("param", i), i as f64, -0.5);
}
let trace = builder.build();
assert_eq!(trace.choices.len(), 100);
assert!(trace.total_log_weight().is_finite());
}
#[test]
fn test_cow_trace_sharing() {
// Create base trace using builder
let mut builder = TraceBuilder::new();
builder.add_sample(addr!("x"), 1.0, -0.5);
let base = builder.build();
let cow_trace = CowTrace::from_trace(base);
// Clone should be fast
let clone1 = cow_trace.clone();
let clone2 = cow_trace.clone();
// Should share data until modification - convert to regular trace to test
let trace1 = clone1.to_trace();
let trace2 = clone2.to_trace();
assert_eq!(trace1.get_f64(&addr!("x")), Some(1.0));
assert_eq!(trace2.get_f64(&addr!("x")), Some(1.0));
}
Testing Framework:
- Memory pool efficiency validation
- Numerical stability regression tests
- Trace construction benchmarking
- COW sharing verification
Production Deployment
Memory Configuration
- Size
TracePool
based on peak concurrent inference - Monitor hit ratios to validate pool efficiency
- Use COW traces for MCMC workloads
- Pre-warm pools before production traffic
Numerical Strategies
- Always use log-space for probability computations
- Validate extreme value handling in testing
- Monitor for numerical instabilities in production
- Use stable algorithms for critical computations
Monitoring and Alerting
- Track inference latency and memory usage
- Monitor pool statistics and efficiency metrics
- Alert on numerical instabilities or performance degradation
- Profile hot paths for optimization opportunities
Common Performance Patterns
- Pool First: Use
TracePool
for any repeated inference - Log Always: Work in log-space for numerical stability
- Batch Everything: Amortize costs across multiple samples
- Monitor Continuously: Track performance metrics in production
- Test Extremes: Validate stability with extreme values
These optimization strategies enable Fugue to handle production-scale probabilistic programming workloads with consistent performance and numerical reliability.