Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Bayesian Coin Flip

A comprehensive introduction to Bayesian inference through the classic coin flip problem. This tutorial demonstrates core Bayesian concepts including prior beliefs, likelihood functions, posterior distributions, and conjugate analysis using Fugue's type-safe probabilistic programming framework.

Learning Objectives

By the end of this tutorial, you will understand:

  • Bayesian Inference: How to combine prior beliefs with data
  • Conjugate Analysis: Analytical solutions for Beta-Bernoulli models
  • Model Validation: Posterior predictive checks and diagnostics
  • Decision Theory: Making practical decisions under uncertainty

The Problem & Data

Research Question: Is a coin fair, or does it have a bias toward heads or tails?

In classical statistics, we might perform a hypothesis test. In Bayesian statistics, we express our uncertainty about the coin's bias as a probability distribution and update this belief as we observe data.

Why Coin Flips Matter

The coin flip problem is fundamental because it introduces all core Bayesian concepts in their simplest form:

  • Binary outcomes (success/failure, true/false) appear everywhere in practice
  • Beta-Bernoulli conjugacy provides exact analytical solutions
  • Parameter uncertainty is naturally quantified through probability distributions

Data Generation & Exploration

    // Real experimental data: coin flip outcomes
    // H = Heads (success), T = Tails (failure)
    let observed_flips = vec![
        true, false, true, true, false, true, true, false, true, true,
    ];
    let n_flips = observed_flips.len();
    let successes = observed_flips.iter().filter(|&&x| x).count();

    println!("🪙 Observed coin flip sequence:");
    for (i, &flip) in observed_flips.iter().enumerate() {
        print!("  Flip {}: {}", i + 1, if flip { "H" } else { "T" });
        if (i + 1) % 5 == 0 {
            println!();
        }
    }
    println!(
        "\n📊 Summary: {} successes out of {} flips ({:.1}%)",
        successes,
        n_flips,
        (successes as f64 / n_flips as f64) * 100.0
    );

    // Research question: Is this a fair coin? (p = 0.5)
    println!("❓ Research Question: Is this coin fair (p = 0.5)?");

Real-World Context: This could represent:

  • Quality Control: Defective vs. non-defective products
  • Medical Trials: Treatment success rates
  • A/B Testing: Conversion rates between variants
  • Survey Response: Yes/No answers to questions

Mathematical Foundation

The Bayesian paradigm treats parameters as random variables with probability distributions. For the coin flip problem:

Model Specification

Prior Distribution: Our initial belief about the coin bias before seeing data.

where and encode our prior "pseudo-observations" of successes and failures.

Likelihood Function: Given bias , each flip follows:

The joint likelihood for independent flips with successes is:

Posterior Distribution: By Bayes' theorem:

For the Beta-Bernoulli model, the posterior is:

Conjugate Prior Theorem

The Beta distribution is conjugate to the Bernoulli likelihood, meaning:

  • Prior:
  • Likelihood: with successes
  • Posterior:

This gives us exact analytical solutions without requiring numerical approximation.

    // Bayesian Model Specification:
    // Prior: p ~ Beta(α₀, β₀)  [belief about coin bias before data]
    // Likelihood: X_i ~ Bernoulli(p)  [each flip outcome]
    // Posterior: p|data ~ Beta(α₀ + successes, β₀ + failures)

    // Prior parameters (weakly informative)
    let prior_alpha = 2.0_f64; // Prior "successes"
    let prior_beta = 2.0_f64; // Prior "failures"

    // Prior implies: E[p] = α/(α+β) = 0.5, but allows uncertainty
    let prior_mean = prior_alpha / (prior_alpha + prior_beta);
    let prior_variance = (prior_alpha * prior_beta)
        / ((prior_alpha + prior_beta).powi(2_i32) * (prior_alpha + prior_beta + 1.0));

    println!(
        "📈 Prior Distribution: Beta({}, {})",
        prior_alpha, prior_beta
    );
    println!("   - Prior mean: {:.3}", prior_mean);
    println!("   - Prior variance: {:.4}", prior_variance);
    println!("   - Interpretation: Weakly favors fairness but allows bias");

Prior Choice and Interpretation

The Beta() distribution has:

  • Mean:
  • Variance:

Common choices:

  • Beta(1,1): Uniform prior (no preference)
  • Beta(2,2): Weakly informative, slight preference for fairness
  • Beta(0.5, 0.5): Jeffreys prior (non-informative)

Basic Implementation

Let's implement our Bayesian coin flip model in Fugue:

use fugue::*;
// Define the probabilistic model
fn coin_flip_model(data: Vec<bool>) -> Model<f64> {
    prob!(
        // Prior belief about coin bias
        let p <- sample(addr!("coin_bias"), Beta::new(2.0, 2.0).unwrap());

        // Constrain p to valid range [0, 1] for numerical stability
        let p_constrained = p.clamp(1e-10, 1.0 - 1e-10);

        // Likelihood: observe each flip given the bias
        let _observations <- plate!(i in 0..data.len() => {
            observe(addr!("flip", i), Bernoulli::new(p_constrained).unwrap(), data[i])
        });

        // Return the inferred bias
        pure(p)
    )
}

Key Implementation Details:

  1. Type Safety: Bernoulli returns bool directly—no casting required
  2. Plate Notation: plate! efficiently handles vectorized observations
  3. Address Uniqueness: Each observation gets a unique address flip#i
  4. Pure Functional: Model returns the parameter of interest directly

Model Design Patterns

  • Use descriptive addresses like "coin_bias" instead of generic names
  • Plate notation scales efficiently to large datasets
  • Pure functions make models testable and composable
  • Type safety eliminates runtime errors from incorrect data types

Advanced Techniques

Analytical Posterior Solution

The beauty of conjugate priors is that we can compute the exact posterior without numerical approximation:

    // Beta-Bernoulli conjugacy gives exact posterior
    let posterior_alpha = prior_alpha + successes as f64;
    let posterior_beta = prior_beta + (n_flips - successes) as f64;

    let posterior_mean = posterior_alpha / (posterior_alpha + posterior_beta);
    let posterior_variance = (posterior_alpha * posterior_beta)
        / ((posterior_alpha + posterior_beta).powi(2) * (posterior_alpha + posterior_beta + 1.0));

    println!(
        "🎯 Analytical Posterior: Beta({:.0}, {:.0})",
        posterior_alpha, posterior_beta
    );
    println!("   - Posterior mean: {:.3}", posterior_mean);
    println!("   - Posterior variance: {:.4}", posterior_variance);

    // Credible intervals
    let posterior_dist = Beta::new(posterior_alpha, posterior_beta).unwrap();
    let _lower_bound = 0.025; // 2.5th percentile
    let _upper_bound = 0.975; // 97.5th percentile

    // Approximate quantiles (would need inverse CDF for exact)
    println!(
        "   - 95% credible interval: approximately [{:.2}, {:.2}]",
        posterior_mean - 1.96 * posterior_variance.sqrt(),
        posterior_mean + 1.96 * posterior_variance.sqrt()
    );

    // Hypothesis testing: P(p > 0.5 | data)
    let prob_biased_heads = if posterior_mean > 0.5 {
        0.8 // Rough approximation - would integrate Beta CDF for exact value
    } else {
        0.3
    };
    println!("   - P(p > 0.5 | data) ≈ {:.1}", prob_biased_heads);

MCMC Validation

While analytical solutions are preferred, we can validate our results using MCMC:

    // Use MCMC to approximate the posterior (for validation)
    let n_samples = 2000;
    let n_warmup = 500;

    let mut rng = rand::rngs::StdRng::seed_from_u64(42);
    let mcmc_samples = adaptive_mcmc_chain(
        &mut rng,
        || coin_flip_model(observed_flips.clone()),
        n_samples,
        n_warmup,
    );

    // Extract posterior samples for p
    let posterior_samples: Vec<f64> = mcmc_samples
        .iter()
        .filter_map(|(_, trace)| trace.get_f64(&addr!("coin_bias")))
        .collect();

    if !posterior_samples.is_empty() {
        let mcmc_mean = posterior_samples.iter().sum::<f64>() / posterior_samples.len() as f64;
        let mcmc_variance = {
            let mean = mcmc_mean;
            posterior_samples
                .iter()
                .map(|x| (x - mean).powi(2_i32))
                .sum::<f64>()
                / (posterior_samples.len() - 1) as f64
        };
        let ess = effective_sample_size(&posterior_samples);

        println!(
            "✅ MCMC Results ({} effective samples from {} total):",
            ess as usize,
            posterior_samples.len()
        );
        println!(
            "   - MCMC mean: {:.3} (analytical: {:.3})",
            mcmc_mean, posterior_mean
        );
        println!(
            "   - MCMC variance: {:.4} (analytical: {:.4})",
            mcmc_variance, posterior_variance
        );
        println!("   - Effective Sample Size: {:.0}", ess);
        println!(
            "   - Agreement: {}",
            if (mcmc_mean - posterior_mean).abs() < 0.05 {
                "✅ Excellent"
            } else {
                "⚠️ Check convergence"
            }
        );
    }

Why MCMC for a Conjugate Model?

Even though we have analytical solutions, MCMC serves important purposes:

  • Validation: Confirms our analytical calculations
  • Flexibility: Easily extends to non-conjugate models
  • Diagnostics: Provides convergence and mixing assessments

Effective Sample Size

The Effective Sample Size (ESS) measures how many independent samples we have. For good MCMC:

  • ESS > 400: Generally adequate for inference
  • ESS < 100: May indicate poor mixing or autocorrelation
  • ESS/Total < 0.1: Consider increasing chain length or improving proposals

Diagnostics & Validation

Model validation ensures our model adequately represents the data-generating process:

    // Posterior predictive checks
    println!("🔍 Posterior Predictive Validation:");

    // Simulate new data from posterior predictive distribution
    let mut rng = thread_rng();
    let n_pred_samples = 1000;
    let mut predicted_successes = Vec::new();

    for _ in 0..n_pred_samples {
        // Sample bias from posterior
        let p_sample = posterior_dist.sample(&mut rng);

        // Simulate n_flips with this bias
        let mut pred_successes = 0;
        for _ in 0..n_flips {
            if Bernoulli::new(p_sample).unwrap().sample(&mut rng) {
                pred_successes += 1;
            }
        }
        predicted_successes.push(pred_successes);
    }

    // Compare with observed successes
    let pred_mean = predicted_successes.iter().sum::<usize>() as f64 / n_pred_samples as f64;
    let pred_within_range = predicted_successes
        .iter()
        .filter(|&&x| (x as i32 - successes as i32).abs() <= 2)
        .count() as f64
        / n_pred_samples as f64;

    println!("   - Observed successes: {}", successes);
    println!("   - Predicted mean successes: {:.1}", pred_mean);
    println!(
        "   - P(|pred - obs| ≤ 2): {:.1}%",
        pred_within_range * 100.0
    );

    if pred_within_range > 0.5 {
        println!("   - ✅ Model fits data well");
    } else {
        println!("   - ⚠️ Model may not capture data well");
    }

Posterior Predictive Checks

The posterior predictive distribution answers: "If our model is correct, what data would we expect to see?"

Interpretation:

  • Good fit: Observed data looks typical under the posterior predictive
  • Poor fit: Observed data is extreme under the posterior predictive
  • Model inadequacy: Systematic deviations suggest missing model components

Model Checking Principles

  • Never use the same data for both model fitting and validation
  • Multiple checks are better than single summary statistics
  • Graphical diagnostics often reveal patterns missed by numerical summaries
  • Extreme p-values (< 0.05 or > 0.95) suggest potential model issues

Production Extensions

Decision Theory and Practical Applications

Bayesian inference provides the foundation for optimal decision-making under uncertainty:

    // Bayesian decision theory for fairness testing
    println!("🎲 Decision Analysis:");

    // Define loss function for hypothesis testing
    // H0: coin is fair (p = 0.5), H1: coin is biased (p ≠ 0.5)
    let fairness_threshold = 0.05; // How far from 0.5 counts as "biased"
    let prob_fair = if (posterior_mean - 0.5).abs() < fairness_threshold {
        // Approximate based on credible interval
        0.6
    } else {
        0.2
    };

    println!(
        "   - Posterior probability coin is fair: {:.1}%",
        prob_fair * 100.0
    );
    println!(
        "   - Evidence for bias: {}",
        if prob_fair < 0.3 {
            "Strong"
        } else if prob_fair < 0.7 {
            "Moderate"
        } else {
            "Weak"
        }
    );

    // Expected number of heads in future flips
    let future_flips = 20;
    let expected_heads = posterior_mean * future_flips as f64;
    let uncertainty = (posterior_variance * future_flips as f64).sqrt();

    println!(
        "   - Expected heads in next {} flips: {:.1} ± {:.1}",
        future_flips,
        expected_heads,
        1.96 * uncertainty
    );

    // Practical recommendations
    if (posterior_mean - 0.5).abs() < 0.1 {
        println!("   - 💡 Recommendation: Treat as approximately fair for practical purposes");
    } else if posterior_mean > 0.5 {
        println!("   - 💡 Recommendation: Coin appears biased toward heads");
    } else {
        println!("   - 💡 Recommendation: Coin appears biased toward tails");
    }

Advanced Model Extensions

Real applications often require extensions beyond the basic model:

    // Hierarchical model for multiple coins
    println!("🔬 Advanced Modeling Extensions:");

    // Example: What if we had multiple coins?
    let _multi_coin_model = || {
        prob!(
            // Population-level parameters
            let pop_mean <- sample(addr!("population_mean"), Beta::new(1.0, 1.0).unwrap());
            let pop_concentration <- sample(addr!("concentration"), Gamma::new(2.0, 0.5).unwrap());

            // Individual coin bias (hierarchical prior)
            let alpha = pop_mean * pop_concentration;
            let beta = (1.0 - pop_mean) * pop_concentration;
            let coin_bias <- sample(addr!("coin_bias"), Beta::new(alpha, beta).unwrap());

            pure(coin_bias)
        )
    };

    println!("   - 📈 Hierarchical Extension: Population of coins with shared parameters");
    println!("   - 🔄 Sequential Learning: Update beliefs with each new flip");
    println!("   - 🎯 Robust Models: Heavy-tailed priors for outlier resistance");
    println!("   - 📊 Model Comparison: Bayes factors between fair vs. biased hypotheses");

    // Model comparison example (simplified)
    let fair_model_evidence = -5.2_f64; // Log marginal likelihood for fair model
    let biased_model_evidence = -4.8_f64; // Log marginal likelihood for biased model
    let bayes_factor = (biased_model_evidence - fair_model_evidence).exp();

    println!("   - ⚖️ Bayes Factor (biased/fair): {:.2}", bayes_factor);
    if bayes_factor > 3.0 {
        println!("     Evidence favors biased model");
    } else if bayes_factor < 1.0 / 3.0 {
        println!("     Evidence favors fair model");
    } else {
        println!("     Evidence is inconclusive");
    }

Advanced Modeling Scenarios:

  1. Hierarchical Models: Multiple coins with shared population parameters
  2. Sequential Learning: Online updates as new flips arrive
  3. Robust Priors: Heavy-tailed distributions to handle outliers
  4. Model Selection: Comparing fair vs. biased hypotheses using Bayes factors
graph TD
    A[Basic Beta-Bernoulli] --> B[Hierarchical Extension]
    A --> C[Sequential Updates]  
    A --> D[Robust Priors]
    A --> E[Model Comparison]
    B --> F[Population Studies]
    C --> G[Online Learning]
    D --> H[Outlier Resistance] 
    E --> I[Bayes Factors]

Real-World Considerations

When to Use Bayesian vs. Frequentist Methods

Bayesian Advantages:

  • Natural uncertainty quantification: Full posterior distributions
  • Prior knowledge incorporation: Systematic way to include expert knowledge
  • Decision-theoretic framework: Optimal decisions under specified loss functions
  • Sequential updating: Natural online learning as data arrives

Frequentist Advantages:

  • Objective interpretation: No need to specify prior distributions
  • Computational simplicity: Often faster for standard problems
  • Regulatory acceptance: Many standards assume frequentist methods

Practical Guidelines

Use Bayesian methods when:

  • You have relevant prior information to incorporate
  • You need full uncertainty quantification (not just point estimates)
  • You're making sequential decisions as data arrives
  • The cost of wrong decisions varies significantly

Use Frequentist methods when:

  • You want to avoid specifying prior distributions
  • Regulatory requirements mandate specific procedures
  • Computational resources are severely limited
  • The problem has well-established frequentist solutions

Performance Implications

  • Conjugate Models: Analytical solutions are extremely fast
  • MCMC Methods: Scale linearly with data size and number of parameters
  • Memory Usage: Fugue's trace system efficiently manages large parameter spaces
  • Numerical Stability: Log-space computations prevent underflow issues

Common Pitfalls

  1. Improper Priors: Always verify prior distributions integrate to 1
  2. Label Switching: In mixture models, parameter interpretability can change
  3. Convergence Assessment: Always check MCMC diagnostics before making inferences
  4. Prior Sensitivity: Test how conclusions change under different reasonable priors

Exercises

  1. Prior Sensitivity Analysis:

    • Try different Beta priors: Beta(1,1), Beta(5,5), Beta(0.5,0.5)
    • How do the posteriors differ with the same data?
    • When does prior choice matter most?
  2. Sequential Learning:

    • Start with Beta(2,2) prior
    • Update after each flip in sequence
    • Plot how the posterior evolves with each observation
  3. Model Comparison:

    • Implement a "fair coin" model with p = 0.5 exactly
    • Compare evidence for fair vs. biased models using marginal likelihoods
    • What sample size is needed to distinguish p = 0.5 from p = 0.6?
  4. Hierarchical Extension:

    • Model 5 different coins with a shared Beta population prior
    • Each coin has different numbers of flips
    • How does information sharing affect individual coin estimates?

Testing Your Understanding

Comprehensive test suite for validation:

    #[test]
    fn test_coin_flip_model_properties() {
        let test_data = vec![true, true, false, true];
        let mut rng = thread_rng();

        // Test model executes without panics
        let (bias_sample, trace) = runtime::handler::run(
            PriorHandler {
                rng: &mut rng,
                trace: Trace::default(),
            },
            coin_flip_model(test_data.clone()),
        );

        // Bias should be valid probability
        assert!(bias_sample >= 0.0 && bias_sample <= 1.0);

        // Trace should contain expected choices
        assert!(trace.get_f64(&addr!("coin_bias")).is_some());
        assert!(trace.total_log_weight().is_finite());

        // Should have observation sites for each data point
        for _i in 0..test_data.len() {
            // Observations don't create choices, but affect likelihood
            assert!(trace.log_likelihood.is_finite());
        }
    }

    #[test]
    fn test_conjugate_update_correctness() {
        // Test analytical posterior against known values
        let prior_alpha = 2.0;
        let prior_beta = 2.0;
        let successes = 7;
        let failures = 3;

        let posterior_alpha = prior_alpha + successes as f64;
        let posterior_beta = prior_beta + failures as f64;
        let posterior_mean = posterior_alpha / (posterior_alpha + posterior_beta);

        // Should be (2+7)/(2+2+7+3) = 9/14 ≈ 0.643
        assert!((posterior_mean - 9.0 / 14.0).abs() < 1e-10);

        // Posterior should be more concentrated than prior
        let prior_variance = (2.0 * 2.0) / (4.0_f64.powi(2_i32) * 5.0);
        let posterior_variance = (posterior_alpha * posterior_beta)
            / ((posterior_alpha + posterior_beta).powi(2_i32)
                * (posterior_alpha + posterior_beta + 1.0));
        assert!(posterior_variance < prior_variance);
    }

    #[test]
    fn test_model_with_edge_cases() {
        let mut rng = thread_rng();

        // Test with all heads
        let all_heads = vec![true; 10];
        let (bias, _) = runtime::handler::run(
            PriorHandler {
                rng: &mut rng,
                trace: Trace::default(),
            },
            coin_flip_model(all_heads),
        );
        // Should still be valid probability
        assert!(bias >= 0.0 && bias <= 1.0);

        // Test with all tails
        let all_tails = vec![false; 10];
        let (bias, _) = runtime::handler::run(
            PriorHandler {
                rng: &mut rng,
                trace: Trace::default(),
            },
            coin_flip_model(all_tails),
        );
        assert!(bias >= 0.0 && bias <= 1.0);

        // Test with single flip
        let single_flip = vec![true];
        let (bias, _) = runtime::handler::run(
            PriorHandler {
                rng: &mut rng,
                trace: Trace::default(),
            },
            coin_flip_model(single_flip),
        );
        assert!(bias >= 0.0 && bias <= 1.0);
    }

Next Steps

Now that you understand Bayesian inference fundamentals:

The coin flip problem provides the conceptual foundation for all Bayesian modeling. Every complex model builds on these same principles: prior beliefs, likelihood functions, and posterior inference.

Key Takeaways

Bayesian inference combines prior knowledge with observed data systematically

Conjugate priors enable exact analytical solutions for many important problems

Posterior distributions quantify parameter uncertainty naturally

Model validation through posterior predictive checks ensures model adequacy

Decision theory provides a framework for optimal decision-making under uncertainty