Optimizing Performance

Contents

Memory-Optimized Inference
Numerical Stability
Efficient Trace Construction
Copy-on-Write for MCMC
Vectorized Model Patterns
Performance Monitoring
Batch Processing
Numerical Precision Testing
Performance Testing
Production Deployment
Common Performance Patterns

Performance optimization in probabilistic programming requires understanding both computational complexity and numerical analysis. This guide explores Fugue's systematic approach to memory optimization, numerical stability, and algorithmic efficiency for production-scale probabilistic workloads.

Computational Complexity Framework

Probabilistic programs exhibit multi-dimensional complexity:

Time complexity: $O (n \cdot d \cdot k)$ for $n$ samples, $d$ parameters, $k$ iterations
Space complexity: $O (d + lo g n)$ with memory pooling
Numerical complexity: Condition number $κ = ∥ A ∥∥ A^{- 1} ∥$ affects convergence

Fugue's optimization framework addresses each dimension systematically.

Memory-Optimized Inference

Memory allocation becomes the computational bottleneck in high-throughput scenarios due to garbage collection overhead. The allocation rate $R_{alloc}$ for naive inference scales as:

$R_{alloc} = n \cdot ∣ T ∣ \cdot f_{gc}$

where $n$ is the sample count, $∣ T ∣$ is the trace size, and $f_{gc}$ is the GC frequency. Fugue's object pooling reduces this to $O (1)$ after warmup:

graph TD
    subgraph "Traditional Allocation"
        A1["Sample 1"] --> B1["Allocate Trace"]
        B1 --> C1["GC Pressure"]
        A2["Sample 2"] --> B2["Allocate Trace"]  
        B2 --> C2["GC Pressure"]
        A3["Sample n"] --> B3["Allocate Trace"]
        B3 --> C3["GC Pressure"]
    end
    
    subgraph "Pooled Allocation"  
        D1["Sample 1"] --> E1["Reuse from Pool"]
        D2["Sample 2"] --> E1
        D3["Sample n"] --> E1
        E1 --> F["Zero GC Pressure"]
    end

    // Create trace pool for zero-allocation inference
    let mut pool = TracePool::new(50); // Pool up to 50 traces
    let mut rng = thread_rng();

    // Define a model that would normally cause many allocations
    let make_model = || {
        prob!(
            let x <- sample(addr!("x"), Normal::new(0.0, 1.0).unwrap());
            let y <- sample(addr!("y"), Normal::new(x, 0.5).unwrap());
            observe(addr!("obs"), Normal::new(y, 0.1).unwrap(), 1.5);
            pure(x)
        )
    };

    // Time pooled vs non-pooled execution
    let start = Instant::now();
    for _iteration in 0..1000 {
        // Use pooled handler for efficient memory reuse
        let (_result, trace) =
            runtime::handler::run(PooledPriorHandler::new(&mut rng, &mut pool), make_model());
        // Return trace to pool for reuse
        pool.return_trace(trace);
    }
    let pooled_time = start.elapsed();

    let stats = pool.stats();
    println!("✅ Completed 1000 iterations with memory pooling");
    println!("   - Execution time: {:?}", pooled_time);
    println!("   - Hit ratio: {:.1}%", stats.hit_ratio());
    println!(
        "   - Pool stats - hits: {}, misses: {}",
        stats.hits, stats.misses
    );

Key Benefits:

Zero-allocation execution after warm-up
Configurable pool size for memory control
Automatic trace recycling and cleanup
Built-in performance monitoring with hit ratios

Numerical Stability

Numerical stability in probabilistic computing requires careful analysis of condition numbers and floating-point precision. The log-sum-exp operation is fundamental:

$LSE (x) = lo g (i = 1 \sum n e^{x_{i}}) = x_{m a x} + lo g (i = 1 \sum n e^{x_{i} - x_{m a x}})$

Stability Analysis: Direct computation of $\sum e^{x_{i}}$ has condition number $κ \approx e^{x_{m a x} - x_{m i n}}$ , which becomes ill-conditioned when $x_{m a x} - x_{m i n} ≫ lo g (ϵ_{machine})$ .

Catastrophic Cancellation

When $x_{i}$ are large and similar, direct computation suffers from catastrophic cancellation: $lo g (e^{100.1} + e^{100.0}) \neq = lo g (e^{100.0} (e^{0.1} + 1))$ The LSE formulation maintains relative precision $O (ϵ_{machine})$ regardless of scale.

    // Demonstrate stable log-probability computations
    let extreme_log_probs = vec![700.0, 701.0, 699.0, 698.0]; // Would overflow in linear space

    // Safe log-sum-exp prevents overflow
    let log_normalizer = log_sum_exp(&extreme_log_probs);
    let normalized_probs = normalize_log_probs(&extreme_log_probs);

    println!("✅ Stable computation with extreme log-probabilities");
    println!("   - Log normalizer: {:.2}", log_normalizer);
    println!(
        "   - Probabilities sum to: {:.10}",
        normalized_probs.iter().sum::<f64>()
    );

    // Weighted log-sum-exp for importance sampling
    let log_values = vec![-1.0, -2.0, -3.0, -4.0];
    let weights = vec![0.4, 0.3, 0.2, 0.1];
    let weighted_result = weighted_log_sum_exp(&log_values, &weights);

    println!("   - Weighted log-sum-exp: {:.4}", weighted_result);

    // Safe logarithm handling
    let safe_results: Vec<f64> = [1.0, 0.0, -1.0].iter().map(|&x| safe_ln(x)).collect();
    println!("   - Safe ln results: {:?}", safe_results);

Stability Features:

log_sum_exp prevents overflow in mixture computations
weighted_log_sum_exp for importance sampling
safe_ln handles edge cases gracefully
All operations maintain numerical precision across scales

Efficient Trace Construction

When building traces programmatically, use TraceBuilder for optimal performance:

    // Use TraceBuilder for efficient trace creation
    let mut builder = TraceBuilder::new();

    let start = Instant::now();
    for i in 0..100 {
        // Add choices efficiently without reallocations
        builder.add_sample(
            addr!("param", i),
            i as f64,
            0.0, // log_prob
        );
    }

    // Build final trace efficiently
    let constructed_trace = builder.build();
    let construction_time = start.elapsed();

    println!("✅ Efficient trace construction");
    println!(
        "   - Built trace with {} choices in {:?}",
        constructed_trace.choices.len(),
        construction_time
    );
    println!(
        "   - Total log weight: {:.2}",
        constructed_trace.total_log_weight()
    );

Construction Benefits:

Pre-allocated data structures minimize reallocations
Type-specific insertion methods avoid boxing overhead
Batch operations for multiple choices
Efficient conversion to immutable traces

Copy-on-Write for MCMC

MCMC algorithms exhibit temporal locality in parameter updates, modifying only $O (lo g d)$ parameters per iteration where $d$ is the total dimensionality. Copy-on-Write (COW) data structures exploit this pattern:

graph TD
    subgraph "MCMC Iteration Structure"
        A["Base Trace T₀"] --> B{"Proposal Step"}
        B --> C["Modified Parameters δ"]
        C --> D{"Small Changes?"}
        D -->|Yes| E["COW: Share + Δ"]
        D -->|No| F["Full Copy"]
        E --> G["O(1) Memory"]
        F --> H["O(d) Memory"]
    end

Complexity Analysis: Traditional MCMC requires $O (d)$ space per sample. COW reduces this to $O (Δ + lo g d)$ where $Δ$ is the edit distance between traces.

    // Create base trace manually for MCMC
    let mut builder = TraceBuilder::new();
    builder.add_sample(addr!("mu"), 0.5, -0.5);
    builder.add_sample(addr!("sigma"), 1.0, -1.0);
    builder.add_sample_bool(addr!("component"), true, -0.69);
    let base_trace = builder.build();

    // Create COW trace for efficient copying
    let cow_base = CowTrace::from_trace(base_trace);

    let start = Instant::now();
    let mut mcmc_traces = Vec::new();

    for _proposal in 0..1000 {
        // Clone is O(1) until modification
        let mut proposal_trace = cow_base.clone();

        // Modify only one parameter (triggers COW)
        proposal_trace.insert_choice(
            addr!("mu"),
            Choice {
                addr: addr!("mu"),
                value: ChoiceValue::F64(0.6),
                logp: -0.4,
            },
        );

        mcmc_traces.push(proposal_trace);
    }
    let cow_time = start.elapsed();

    println!("✅ Copy-on-write MCMC proposals");
    println!("   - Created 1000 proposal traces in {:?}", cow_time);
    println!("   - Memory sharing until modification");

MCMC Optimizations:

O(1) trace cloning until modification
Shared memory for unchanged parameters
Lazy copying only when traces diverge
Perfect for Metropolis-Hastings and Gibbs sampling

Vectorized Model Patterns

Structure models for efficient batch processing:

    // Pre-allocate data structures for repeated use
    let observations: Vec<f64> = (0..100).map(|i| i as f64 * 0.1).collect();
    let n = observations.len();

    // Efficient vectorized model
    let vectorized_model = || {
        prob!(
            let mu <- sample(addr!("global_mu"), Normal::new(0.0, 10.0).unwrap());
            let precision <- sample(addr!("precision"), Gamma::new(2.0, 1.0).unwrap());
            let sigma = (1.0 / precision).sqrt();

            // Use plate for efficient vectorized operations
            let _likelihoods <- plate!(i in 0..n => {
                observe(addr!("obs", i), Normal::new(mu, sigma).unwrap(), observations[i])
            });

            pure((mu, sigma))
        )
    };

    let start = Instant::now();
    let (_result, _trace) = runtime::handler::run(
        PriorHandler {
            rng: &mut rng,
            trace: Trace::default(),
        },
        vectorized_model(),
    );
    let vectorized_time = start.elapsed();

    println!("✅ Optimized vectorized model");
    println!("   - Processed {} observations in {:?}", n, vectorized_time);

Vectorization Strategy:

Pre-allocate data collections
Use plate! for independent parallel operations
Minimize dynamic allocations in hot paths
Leverage compiler optimizations with static sizing

Performance Monitoring

Systematic performance monitoring requires tracking multiple performance metrics with their theoretical bounds:

$Throughput Latency Memory Efficiency = \frac{samples}{time} \leq \frac{1}{τ _{m i n}} = time per sample \geq τ_{m i n} = \frac{useful allocations}{total allocations} \to 1$

where $τ_{m i n}$ is the theoretical minimum execution time per sample.

Amdahl's Law for MCMC

Even with perfect parallelization, MCMC exhibits sequential dependencies that limit speedup: $S_{max} = \frac{1}{f _{seq} + \frac{1 - f _{seq}}{p}}$ where $f_{seq}$ is the fraction of sequential computation and $p$ is the number of processors.

    // Monitor trace characteristics for optimization insights
    #[derive(Debug)]
    struct TraceMetrics {
        num_choices: usize,
        log_weight: f64,
        is_valid: bool,
        memory_size_estimate: usize,
    }

    impl TraceMetrics {
        fn from_trace(trace: &Trace) -> Self {
            let num_choices = trace.choices.len();
            let log_weight = trace.total_log_weight();
            let is_valid = log_weight.is_finite();

            // Rough memory estimate (actual implementation would be more precise)
            let memory_size_estimate = num_choices * 64; // Rough bytes per choice

            Self {
                num_choices,
                log_weight,
                is_valid,
                memory_size_estimate,
            }
        }
    }

    // Example: Monitor a complex model's performance
    let complex_model = || {
        prob!(
            let components <- plate!(c in 0..5 => {
                sample(addr!("weight", c), Gamma::new(1.0, 1.0).unwrap())
                    .bind(move |weight| {
                        sample(addr!("mu", c), Normal::new(0.0, 2.0).unwrap())
                            .map(move |mu| (weight, mu))
                    })
            });

            let selector <- sample(addr!("selector"),
                                  Categorical::new(vec![0.2, 0.2, 0.2, 0.2, 0.2]).unwrap());

            pure((components, selector))
        )
    };

    let (_result, trace) = runtime::handler::run(
        PriorHandler {
            rng: &mut rng,
            trace: Trace::default(),
        },
        complex_model(),
    );

    let metrics = TraceMetrics::from_trace(&trace);
    println!("✅ Performance monitoring active");
    println!("   - Trace choices: {}", metrics.num_choices);
    println!("   - Log weight: {:.2}", metrics.log_weight);
    println!("   - Valid: {}", metrics.is_valid);
    println!(
        "   - Memory estimate: {} bytes",
        metrics.memory_size_estimate
    );

Monitoring Approach:

Collect trace characteristics for optimization insights
Track memory usage patterns
Validate numerical stability
Profile execution bottlenecks

Batch Processing

Batch processing amortizes setup costs and exploits hardware parallelism. The optimal batch size $b^{*}$ balances memory usage and throughput:

$b^{*} = ar g b min (\frac{C _{setup}}{b} + b \cdot C_{memory} + \frac{C _{sync}}{b})$

where:

$C_{setup}$ is the per-batch initialization cost
$C_{memory}$ is the per-sample memory cost
$C_{sync}$ is the synchronization overhead

graph LR
    subgraph "Performance vs Batch Size"
        A["Small Batches<br/>b → 1"] --> B["High Setup<br/>Overhead"]
        C["Large Batches<br/>b → ∞"] --> D["Memory<br/>Pressure"]  
        E["Optimal Batch<br/>b*"] --> F["Balanced<br/>Performance"]
    end

    // Efficient batch inference using memory pooling
    let batch_size = 100;
    let mut batch_pool = TracePool::new(batch_size);

    let start = Instant::now();
    let mut batch_results = Vec::with_capacity(batch_size);

    for _batch in 0..batch_size {
        let (result, trace) = runtime::handler::run(
            PooledPriorHandler::new(&mut rng, &mut batch_pool),
            make_model(),
        );
        // Return trace to pool for reuse
        batch_pool.return_trace(trace);
        batch_results.push(result);
    }

    let batch_time = start.elapsed();
    let batch_stats = batch_pool.stats();

    println!("✅ Batch processing complete");
    println!("   - Processed {} samples in {:?}", batch_size, batch_time);
    println!(
        "   - Average time per sample: {:?}",
        batch_time / batch_size as u32
    );
    println!(
        "   - Memory efficiency: {:.1}% hit ratio",
        batch_stats.hit_ratio()
    );

Batch Benefits:

Amortized setup costs across samples
Memory pool reuse for consistent performance
Scalable to large sample counts
Predictable memory footprint

Numerical Precision Testing

Validate stability across different computational scales:

    // Test numerical stability across different scales
    let test_scales = vec![1e-10, 1e-5, 1.0, 1e5, 1e10];

    for &scale in &test_scales {
        let scale: f64 = scale;
        let log_vals = vec![scale.ln() + 1.0, scale.ln() + 2.0, scale.ln() + 0.5];

        let stable_sum = log_sum_exp(&log_vals);
        let log1p_result = log1p_exp(scale.ln());

        println!(
            "   Scale {:.0e}: log_sum_exp={:.4}, log1p_exp={:.4}",
            scale, stable_sum, log1p_result
        );
    }
    println!("✅ Numerical stability verified across scales");

Testing Strategy:

Verify stability across extreme value ranges
Test edge cases and boundary conditions
Validate consistency of numerical operations
Profile precision vs. performance trade-offs

Performance Testing

Implement systematic performance validation:

    #[test]
    fn test_memory_pool_efficiency() {
        let mut pool = TracePool::new(10);
        let mut rng = thread_rng();

        // Test pool reuse with PooledPriorHandler
        for _i in 0..20 {
            let (_, trace) = runtime::handler::run(
                PooledPriorHandler::new(&mut rng, &mut pool),
                sample(addr!("test"), Normal::new(0.0, 1.0).unwrap()),
            );
            // Return trace to pool for reuse
            pool.return_trace(trace);
        }

        let stats = pool.stats();
        assert!(
            stats.hit_ratio() > 50.0,
            "Pool should have good hit ratio, got {:.1}%",
            stats.hit_ratio()
        );
        assert!(stats.hits + stats.misses > 0, "Pool should have been used");
    }

    #[test]
    fn test_numerical_stability() {
        // Test log_sum_exp with extreme values
        let extreme_vals = vec![700.0, 701.0, 699.0];
        let result = log_sum_exp(&extreme_vals);
        assert!(
            result.is_finite(),
            "log_sum_exp should handle extreme values"
        );

        // Test normalization
        let normalized = normalize_log_probs(&extreme_vals);
        let sum: f64 = normalized.iter().sum();
        assert!(
            (sum - 1.0).abs() < 1e-10,
            "Normalized probabilities should sum to 1"
        );

        // Test weighted computation
        let weights = vec![0.5, 0.3, 0.2];
        let weighted_result = weighted_log_sum_exp(&extreme_vals, &weights);
        assert!(
            weighted_result.is_finite(),
            "Weighted log_sum_exp should be finite"
        );
    }

    #[test]
    fn test_trace_builder_efficiency() {
        let mut builder = TraceBuilder::new();

        // Add many choices efficiently
        for i in 0..100 {
            builder.add_sample(addr!("param", i), i as f64, -0.5);
        }

        let trace = builder.build();
        assert_eq!(trace.choices.len(), 100);
        assert!(trace.total_log_weight().is_finite());
    }

    #[test]
    fn test_cow_trace_sharing() {
        // Create base trace using builder
        let mut builder = TraceBuilder::new();
        builder.add_sample(addr!("x"), 1.0, -0.5);
        let base = builder.build();
        let cow_trace = CowTrace::from_trace(base);

        // Clone should be fast
        let clone1 = cow_trace.clone();
        let clone2 = cow_trace.clone();

        // Should share data until modification - convert to regular trace to test
        let trace1 = clone1.to_trace();
        let trace2 = clone2.to_trace();
        assert_eq!(trace1.get_f64(&addr!("x")), Some(1.0));
        assert_eq!(trace2.get_f64(&addr!("x")), Some(1.0));
    }

Testing Framework:

Memory pool efficiency validation
Numerical stability regression tests
Trace construction benchmarking
COW sharing verification

Production Deployment

Memory Configuration

Size TracePool based on peak concurrent inference
Monitor hit ratios to validate pool efficiency
Use COW traces for MCMC workloads
Pre-warm pools before production traffic

Numerical Strategies

Always use log-space for probability computations
Validate extreme value handling in testing
Monitor for numerical instabilities in production
Use stable algorithms for critical computations

Monitoring and Alerting

Track inference latency and memory usage
Monitor pool statistics and efficiency metrics
Alert on numerical instabilities or performance degradation
Profile hot paths for optimization opportunities

Common Performance Patterns

Pool First: Use TracePool for any repeated inference
Log Always: Work in log-space for numerical stability
Batch Everything: Amortize costs across multiple samples
Monitor Continuously: Track performance metrics in production
Test Extremes: Validate stability with extreme values

These optimization strategies enable Fugue to handle production-scale probabilistic programming workloads with consistent performance and numerical reliability.

Keyboard shortcuts

Fugue Docs