Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Optimizing Performance

Performance optimization in probabilistic programming requires understanding both computational complexity and numerical analysis. This guide explores Fugue's systematic approach to memory optimization, numerical stability, and algorithmic efficiency for production-scale probabilistic workloads.

Computational Complexity Framework

Probabilistic programs exhibit multi-dimensional complexity:

  • Time complexity: for samples, parameters, iterations
  • Space complexity: with memory pooling
  • Numerical complexity: Condition number affects convergence

Fugue's optimization framework addresses each dimension systematically.

Memory-Optimized Inference

Memory allocation becomes the computational bottleneck in high-throughput scenarios due to garbage collection overhead. The allocation rate for naive inference scales as:

where is the sample count, is the trace size, and is the GC frequency. Fugue's object pooling reduces this to after warmup:

graph TD
    subgraph "Traditional Allocation"
        A1["Sample 1"] --> B1["Allocate Trace"]
        B1 --> C1["GC Pressure"]
        A2["Sample 2"] --> B2["Allocate Trace"]  
        B2 --> C2["GC Pressure"]
        A3["Sample n"] --> B3["Allocate Trace"]
        B3 --> C3["GC Pressure"]
    end
    
    subgraph "Pooled Allocation"  
        D1["Sample 1"] --> E1["Reuse from Pool"]
        D2["Sample 2"] --> E1
        D3["Sample n"] --> E1
        E1 --> F["Zero GC Pressure"]
    end
    // Create trace pool for zero-allocation inference
    let mut pool = TracePool::new(50); // Pool up to 50 traces
    let mut rng = thread_rng();

    // Define a model that would normally cause many allocations
    let make_model = || {
        prob!(
            let x <- sample(addr!("x"), Normal::new(0.0, 1.0).unwrap());
            let y <- sample(addr!("y"), Normal::new(x, 0.5).unwrap());
            observe(addr!("obs"), Normal::new(y, 0.1).unwrap(), 1.5);
            pure(x)
        )
    };

    // Time pooled vs non-pooled execution
    let start = Instant::now();
    for _iteration in 0..1000 {
        // Use pooled handler for efficient memory reuse
        let (_result, trace) =
            runtime::handler::run(PooledPriorHandler::new(&mut rng, &mut pool), make_model());
        // Return trace to pool for reuse
        pool.return_trace(trace);
    }
    let pooled_time = start.elapsed();

    let stats = pool.stats();
    println!("✅ Completed 1000 iterations with memory pooling");
    println!("   - Execution time: {:?}", pooled_time);
    println!("   - Hit ratio: {:.1}%", stats.hit_ratio());
    println!(
        "   - Pool stats - hits: {}, misses: {}",
        stats.hits, stats.misses
    );

Key Benefits:

  • Zero-allocation execution after warm-up
  • Configurable pool size for memory control
  • Automatic trace recycling and cleanup
  • Built-in performance monitoring with hit ratios

Numerical Stability

Numerical stability in probabilistic computing requires careful analysis of condition numbers and floating-point precision. The log-sum-exp operation is fundamental:

Stability Analysis: Direct computation of has condition number , which becomes ill-conditioned when .

Catastrophic Cancellation

When are large and similar, direct computation suffers from catastrophic cancellation: The LSE formulation maintains relative precision regardless of scale.

    // Demonstrate stable log-probability computations
    let extreme_log_probs = vec![700.0, 701.0, 699.0, 698.0]; // Would overflow in linear space

    // Safe log-sum-exp prevents overflow
    let log_normalizer = log_sum_exp(&extreme_log_probs);
    let normalized_probs = normalize_log_probs(&extreme_log_probs);

    println!("✅ Stable computation with extreme log-probabilities");
    println!("   - Log normalizer: {:.2}", log_normalizer);
    println!(
        "   - Probabilities sum to: {:.10}",
        normalized_probs.iter().sum::<f64>()
    );

    // Weighted log-sum-exp for importance sampling
    let log_values = vec![-1.0, -2.0, -3.0, -4.0];
    let weights = vec![0.4, 0.3, 0.2, 0.1];
    let weighted_result = weighted_log_sum_exp(&log_values, &weights);

    println!("   - Weighted log-sum-exp: {:.4}", weighted_result);

    // Safe logarithm handling
    let safe_results: Vec<f64> = [1.0, 0.0, -1.0].iter().map(|&x| safe_ln(x)).collect();
    println!("   - Safe ln results: {:?}", safe_results);

Stability Features:

  • log_sum_exp prevents overflow in mixture computations
  • weighted_log_sum_exp for importance sampling
  • safe_ln handles edge cases gracefully
  • All operations maintain numerical precision across scales

Efficient Trace Construction

When building traces programmatically, use TraceBuilder for optimal performance:

    // Use TraceBuilder for efficient trace creation
    let mut builder = TraceBuilder::new();

    let start = Instant::now();
    for i in 0..100 {
        // Add choices efficiently without reallocations
        builder.add_sample(
            addr!("param", i),
            i as f64,
            0.0, // log_prob
        );
    }

    // Build final trace efficiently
    let constructed_trace = builder.build();
    let construction_time = start.elapsed();

    println!("✅ Efficient trace construction");
    println!(
        "   - Built trace with {} choices in {:?}",
        constructed_trace.choices.len(),
        construction_time
    );
    println!(
        "   - Total log weight: {:.2}",
        constructed_trace.total_log_weight()
    );

Construction Benefits:

  • Pre-allocated data structures minimize reallocations
  • Type-specific insertion methods avoid boxing overhead
  • Batch operations for multiple choices
  • Efficient conversion to immutable traces

Copy-on-Write for MCMC

MCMC algorithms exhibit temporal locality in parameter updates, modifying only parameters per iteration where is the total dimensionality. Copy-on-Write (COW) data structures exploit this pattern:

graph TD
    subgraph "MCMC Iteration Structure"
        A["Base Trace T₀"] --> B{"Proposal Step"}
        B --> C["Modified Parameters δ"]
        C --> D{"Small Changes?"}
        D -->|Yes| E["COW: Share + Δ"]
        D -->|No| F["Full Copy"]
        E --> G["O(1) Memory"]
        F --> H["O(d) Memory"]
    end

Complexity Analysis: Traditional MCMC requires space per sample. COW reduces this to where is the edit distance between traces.

    // Create base trace manually for MCMC
    let mut builder = TraceBuilder::new();
    builder.add_sample(addr!("mu"), 0.5, -0.5);
    builder.add_sample(addr!("sigma"), 1.0, -1.0);
    builder.add_sample_bool(addr!("component"), true, -0.69);
    let base_trace = builder.build();

    // Create COW trace for efficient copying
    let cow_base = CowTrace::from_trace(base_trace);

    let start = Instant::now();
    let mut mcmc_traces = Vec::new();

    for _proposal in 0..1000 {
        // Clone is O(1) until modification
        let mut proposal_trace = cow_base.clone();

        // Modify only one parameter (triggers COW)
        proposal_trace.insert_choice(
            addr!("mu"),
            Choice {
                addr: addr!("mu"),
                value: ChoiceValue::F64(0.6),
                logp: -0.4,
            },
        );

        mcmc_traces.push(proposal_trace);
    }
    let cow_time = start.elapsed();

    println!("✅ Copy-on-write MCMC proposals");
    println!("   - Created 1000 proposal traces in {:?}", cow_time);
    println!("   - Memory sharing until modification");

MCMC Optimizations:

  • O(1) trace cloning until modification
  • Shared memory for unchanged parameters
  • Lazy copying only when traces diverge
  • Perfect for Metropolis-Hastings and Gibbs sampling

Vectorized Model Patterns

Structure models for efficient batch processing:

    // Pre-allocate data structures for repeated use
    let observations: Vec<f64> = (0..100).map(|i| i as f64 * 0.1).collect();
    let n = observations.len();

    // Efficient vectorized model
    let vectorized_model = || {
        prob!(
            let mu <- sample(addr!("global_mu"), Normal::new(0.0, 10.0).unwrap());
            let precision <- sample(addr!("precision"), Gamma::new(2.0, 1.0).unwrap());
            let sigma = (1.0 / precision).sqrt();

            // Use plate for efficient vectorized operations
            let _likelihoods <- plate!(i in 0..n => {
                observe(addr!("obs", i), Normal::new(mu, sigma).unwrap(), observations[i])
            });

            pure((mu, sigma))
        )
    };

    let start = Instant::now();
    let (_result, _trace) = runtime::handler::run(
        PriorHandler {
            rng: &mut rng,
            trace: Trace::default(),
        },
        vectorized_model(),
    );
    let vectorized_time = start.elapsed();

    println!("✅ Optimized vectorized model");
    println!("   - Processed {} observations in {:?}", n, vectorized_time);

Vectorization Strategy:

  • Pre-allocate data collections
  • Use plate! for independent parallel operations
  • Minimize dynamic allocations in hot paths
  • Leverage compiler optimizations with static sizing

Performance Monitoring

Systematic performance monitoring requires tracking multiple performance metrics with their theoretical bounds:

where is the theoretical minimum execution time per sample.

Amdahl's Law for MCMC

Even with perfect parallelization, MCMC exhibits sequential dependencies that limit speedup: where is the fraction of sequential computation and is the number of processors.

    // Monitor trace characteristics for optimization insights
    #[derive(Debug)]
    struct TraceMetrics {
        num_choices: usize,
        log_weight: f64,
        is_valid: bool,
        memory_size_estimate: usize,
    }

    impl TraceMetrics {
        fn from_trace(trace: &Trace) -> Self {
            let num_choices = trace.choices.len();
            let log_weight = trace.total_log_weight();
            let is_valid = log_weight.is_finite();

            // Rough memory estimate (actual implementation would be more precise)
            let memory_size_estimate = num_choices * 64; // Rough bytes per choice

            Self {
                num_choices,
                log_weight,
                is_valid,
                memory_size_estimate,
            }
        }
    }

    // Example: Monitor a complex model's performance
    let complex_model = || {
        prob!(
            let components <- plate!(c in 0..5 => {
                sample(addr!("weight", c), Gamma::new(1.0, 1.0).unwrap())
                    .bind(move |weight| {
                        sample(addr!("mu", c), Normal::new(0.0, 2.0).unwrap())
                            .map(move |mu| (weight, mu))
                    })
            });

            let selector <- sample(addr!("selector"),
                                  Categorical::new(vec![0.2, 0.2, 0.2, 0.2, 0.2]).unwrap());

            pure((components, selector))
        )
    };

    let (_result, trace) = runtime::handler::run(
        PriorHandler {
            rng: &mut rng,
            trace: Trace::default(),
        },
        complex_model(),
    );

    let metrics = TraceMetrics::from_trace(&trace);
    println!("✅ Performance monitoring active");
    println!("   - Trace choices: {}", metrics.num_choices);
    println!("   - Log weight: {:.2}", metrics.log_weight);
    println!("   - Valid: {}", metrics.is_valid);
    println!(
        "   - Memory estimate: {} bytes",
        metrics.memory_size_estimate
    );

Monitoring Approach:

  • Collect trace characteristics for optimization insights
  • Track memory usage patterns
  • Validate numerical stability
  • Profile execution bottlenecks

Batch Processing

Batch processing amortizes setup costs and exploits hardware parallelism. The optimal batch size balances memory usage and throughput:

where:

  • is the per-batch initialization cost
  • is the per-sample memory cost
  • is the synchronization overhead
graph LR
    subgraph "Performance vs Batch Size"
        A["Small Batches<br/>b → 1"] --> B["High Setup<br/>Overhead"]
        C["Large Batches<br/>b → ∞"] --> D["Memory<br/>Pressure"]  
        E["Optimal Batch<br/>b*"] --> F["Balanced<br/>Performance"]
    end
    // Efficient batch inference using memory pooling
    let batch_size = 100;
    let mut batch_pool = TracePool::new(batch_size);

    let start = Instant::now();
    let mut batch_results = Vec::with_capacity(batch_size);

    for _batch in 0..batch_size {
        let (result, trace) = runtime::handler::run(
            PooledPriorHandler::new(&mut rng, &mut batch_pool),
            make_model(),
        );
        // Return trace to pool for reuse
        batch_pool.return_trace(trace);
        batch_results.push(result);
    }

    let batch_time = start.elapsed();
    let batch_stats = batch_pool.stats();

    println!("✅ Batch processing complete");
    println!("   - Processed {} samples in {:?}", batch_size, batch_time);
    println!(
        "   - Average time per sample: {:?}",
        batch_time / batch_size as u32
    );
    println!(
        "   - Memory efficiency: {:.1}% hit ratio",
        batch_stats.hit_ratio()
    );

Batch Benefits:

  • Amortized setup costs across samples
  • Memory pool reuse for consistent performance
  • Scalable to large sample counts
  • Predictable memory footprint

Numerical Precision Testing

Validate stability across different computational scales:

    // Test numerical stability across different scales
    let test_scales = vec![1e-10, 1e-5, 1.0, 1e5, 1e10];

    for &scale in &test_scales {
        let scale: f64 = scale;
        let log_vals = vec![scale.ln() + 1.0, scale.ln() + 2.0, scale.ln() + 0.5];

        let stable_sum = log_sum_exp(&log_vals);
        let log1p_result = log1p_exp(scale.ln());

        println!(
            "   Scale {:.0e}: log_sum_exp={:.4}, log1p_exp={:.4}",
            scale, stable_sum, log1p_result
        );
    }
    println!("✅ Numerical stability verified across scales");

Testing Strategy:

  • Verify stability across extreme value ranges
  • Test edge cases and boundary conditions
  • Validate consistency of numerical operations
  • Profile precision vs. performance trade-offs

Performance Testing

Implement systematic performance validation:

    #[test]
    fn test_memory_pool_efficiency() {
        let mut pool = TracePool::new(10);
        let mut rng = thread_rng();

        // Test pool reuse with PooledPriorHandler
        for _i in 0..20 {
            let (_, trace) = runtime::handler::run(
                PooledPriorHandler::new(&mut rng, &mut pool),
                sample(addr!("test"), Normal::new(0.0, 1.0).unwrap()),
            );
            // Return trace to pool for reuse
            pool.return_trace(trace);
        }

        let stats = pool.stats();
        assert!(
            stats.hit_ratio() > 50.0,
            "Pool should have good hit ratio, got {:.1}%",
            stats.hit_ratio()
        );
        assert!(stats.hits + stats.misses > 0, "Pool should have been used");
    }

    #[test]
    fn test_numerical_stability() {
        // Test log_sum_exp with extreme values
        let extreme_vals = vec![700.0, 701.0, 699.0];
        let result = log_sum_exp(&extreme_vals);
        assert!(
            result.is_finite(),
            "log_sum_exp should handle extreme values"
        );

        // Test normalization
        let normalized = normalize_log_probs(&extreme_vals);
        let sum: f64 = normalized.iter().sum();
        assert!(
            (sum - 1.0).abs() < 1e-10,
            "Normalized probabilities should sum to 1"
        );

        // Test weighted computation
        let weights = vec![0.5, 0.3, 0.2];
        let weighted_result = weighted_log_sum_exp(&extreme_vals, &weights);
        assert!(
            weighted_result.is_finite(),
            "Weighted log_sum_exp should be finite"
        );
    }

    #[test]
    fn test_trace_builder_efficiency() {
        let mut builder = TraceBuilder::new();

        // Add many choices efficiently
        for i in 0..100 {
            builder.add_sample(addr!("param", i), i as f64, -0.5);
        }

        let trace = builder.build();
        assert_eq!(trace.choices.len(), 100);
        assert!(trace.total_log_weight().is_finite());
    }

    #[test]
    fn test_cow_trace_sharing() {
        // Create base trace using builder
        let mut builder = TraceBuilder::new();
        builder.add_sample(addr!("x"), 1.0, -0.5);
        let base = builder.build();
        let cow_trace = CowTrace::from_trace(base);

        // Clone should be fast
        let clone1 = cow_trace.clone();
        let clone2 = cow_trace.clone();

        // Should share data until modification - convert to regular trace to test
        let trace1 = clone1.to_trace();
        let trace2 = clone2.to_trace();
        assert_eq!(trace1.get_f64(&addr!("x")), Some(1.0));
        assert_eq!(trace2.get_f64(&addr!("x")), Some(1.0));
    }

Testing Framework:

  • Memory pool efficiency validation
  • Numerical stability regression tests
  • Trace construction benchmarking
  • COW sharing verification

Production Deployment

Memory Configuration

  • Size TracePool based on peak concurrent inference
  • Monitor hit ratios to validate pool efficiency
  • Use COW traces for MCMC workloads
  • Pre-warm pools before production traffic

Numerical Strategies

  • Always use log-space for probability computations
  • Validate extreme value handling in testing
  • Monitor for numerical instabilities in production
  • Use stable algorithms for critical computations

Monitoring and Alerting

  • Track inference latency and memory usage
  • Monitor pool statistics and efficiency metrics
  • Alert on numerical instabilities or performance degradation
  • Profile hot paths for optimization opportunities

Common Performance Patterns

  1. Pool First: Use TracePool for any repeated inference
  2. Log Always: Work in log-space for numerical stability
  3. Batch Everything: Amortize costs across multiple samples
  4. Monitor Continuously: Track performance metrics in production
  5. Test Extremes: Validate stability with extreme values

These optimization strategies enable Fugue to handle production-scale probabilistic programming workloads with consistent performance and numerical reliability.