Counting on Laplace

Brute force
Look: up in the sky it’s a …
- Is an easier way?
Enough for now!
References

Brute force

A popular, easy both to imagine and construct, approach to mashing data into hypotheses is the grid approximation approach. We imagine the 101 possible and quite hypothetical probabilities of a positive result in, say, testing for a novel virus. This is our grid of hypotheses. In this scenario we are sampling from all 8 zip codes in the Bronx. Suppose we find 2 positive zip codes, the rest negative. We will call positive a success. The binary choice binomial model will be the observational probabilistic model. We tibble into a solution.

#library(tidyverse)
#library(rethinking)
n <- 1000
h <- 11
n_success <- 2 # 2 remarkable results - yikes
n_trials  <- 8 # 8 Bronx zip codes
d <-
  tibble(
    p_grid = seq(#
                  from = 0, 
                  to = 1, 
                  length.out = h),
    # note we use a flat not so informative uniform prior
    prior  = 1
    ) |>
  mutate(likelihood = dbinom(n_success, 
                             size = n_trials, 
                             prob = p_grid)) |> 
  mutate(posterior = (likelihood*prior)/sum(likelihood*prior))

summary( d )

##      p_grid         prior     likelihood          posterior       
##  Min.   :0.00   Min.   :1   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.25   1st Qu.:1   1st Qu.:0.0005848   1st Qu.:0.000527  
##  Median :0.50   Median :1   Median :0.0412877   Median :0.037205  
##  Mean   :0.50   Mean   :1   Mean   :0.1008848   Mean   :0.090909  
##  3rd Qu.:0.75   3rd Qu.:1   3rd Qu.:0.1789112   3rd Qu.:0.161220  
##  Max.   :1.00   Max.   :1   Max.   :0.2964755   Max.   :0.267159

The job is almost done and now for the core reason we are here. We sample these hypothetical values of the probability of a single positive test, and visualize our handiwork.

# how many samples would you like?
n_samples <- 10000
# make it reproducible
set.seed(42)
samples <-
  d |> 
  sample_n( size = n_samples, 
            weight = posterior, replace = T )
glimpse(samples)

## Rows: 10,000
## Columns: 4
## $ p_grid     <dbl> 0.5, 0.5, 0.2, 0.1, 0.4, 0.2, 0.1, 0.3, 0.4, 0.4, 0.2, 0.4,…
## $ prior      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ likelihood <dbl> 0.10937500, 0.10937500, 0.29360128, 0.14880348, 0.20901888,…
## $ posterior  <dbl> 0.09855972, 0.09855972, 0.26456924, 0.13408941, 0.18835056,…

#
y_label <- "h = proportion of positive tests"
x_label <- "sample index"
samples |> 
  mutate(sample_number = 1:n()) |>
# Here's the cloud of unknowning
  ggplot(aes( x = sample_number, y = p_grid) ) +
  geom_point( alpha = 1/10 ) +
  scale_y_continuous( y_label, limits = c(0, 1) ) +
  xlab( x_label )

Let’s transform the cloud of unknowing into the frequency of ways in which 8 zip codes can plausibly have 2 positive results. Let’s also add a vertical line to indicate the mode of this posterior distribution. The tidybayes package has a slew of useful summarizing statistics, including the Mode() which we use below to find the proportion of tests that correspond to the posterior’s mode. ¹ The Maximum A Posteriori (aka MAP) hypothesis is just the mode, that is, the most frequently occurring value of the parameter \(p\), This will be one point estimate we can report.

#library(plotly)
library(tidybayes)
p_MAP <- Mode(samples$p_grid)
title <- "Bronx Zip Code Tests"
x_label <- "proportion of zip codes testing positive"
y_label <- "posterior density"
p <- samples |> 
  ggplot(aes(x = p_grid)) +
  geom_histogram(fill = "blue", alpha = 0.3) +
  scale_x_continuous(x_label, limits = c(0, 1)) +
  geom_vline(xintercept=p_MAP, color = "orange") +
  annotate( "text", x = 0.50, y = 2000, 
            label = concat("Maximum a Posteriori = ", round(p_MAP, 4)) ) +
  ylab(y_label) + xlab(x_label) +
  ggtitle(title)
plotly::ggplotly(p) # for interactive graphics

Other measures of tendency include quantiles. We can summarize our testing results in a number of ways depending on the purpose of the report and the desires of the client.

Look: up in the sky it’s a …

Instead of our brute force grid approximation technique we can deploy Laplace, Truscott, and Emory (1902)’s quadratic approximation technique. This approach takes advantage of the observation that, around the mode of the posterior distribution, the distribution is approximately Gaussian-shaped – the bell curve. A Gaussian shape is just a quadratic combination of the hypotheses. This technique will work for any observational model with a (roughly) 50% (median) location with prominent density. Other observational models, such as extreme value Pareto and power-distributions will require a very different poly-number interpretation than the typical quadratic-binomial expansion approach.

Here is a version for the simplest model that is not binomial (or poisson for that matter). The Laplace quadratic approximation technique has two steps.

Find the optimal values of posterior parameters (MAP).
Given the optimal value of posterior parameters and their interactions across the data, simulate posterior values of parameters and data predictions.

We will begin to control under and over-flow issues (really small and really large numbers) by using the following relationship between the product, here, of two likelihoods, \(pr_1\) and \(pr_2\), and the exponent of the sum of logarithms of the two likelihoods.

\[\begin{equation} pr_1pr_2 = exp[{log(pr_1)+log(pr_2)}] \end{equation}\]

We shift gears out of first gear binomial models into the second gear of a simple mean and standard deviation of a set of continuous data like temperature or electrical usage or even square hectares of land versus water. We suppose our data looks like this (simulated to protect the innocent) and pretend we are marketing managers who are reviewing a price predator’s poaching of our customers per month, measured in the amount of time customers spend with the predator. The dimensions of \(y\) are customer-hours per month.

set.seed(4242)
y <- rnorm(n = 1000, mean = 6 , sd = 2) 
# customer-hours per month 
# (2 hours/visit*3 visits/month=6 customer-hours)
c(mean = mean(y), sd = sd(y))

##     mean       sd 
## 5.939485 2.049180

That’s the data: \(y_i\) for \(i=1 \cdots 20\) and its specification in R. Generating synthetic data like this simulation help us to be sure our more realistic models are working correctly. By working correctly we will mean that the parameters we estimate also calibrate to the parameters we simulated our synthetically generated data.

For the priors we will let the mean be normally distributed (Gaussian) and possibly range from 0 to 100, since we do not know any differently at this stage. We can regularize the range of the mean by considering the data itself. A range might be a mean of 6 plus/minus 2 standard deviations. Normal distributions are symmetric and are almost triangular looking if we squint at them.

We let the standard deviation take on a log Gaussian shape, which is bounded at 0, since standard deviations are always greater than or equal to zero. The lognormal distribution will have a shape that feels a bit exponential, almost chi-squared. But that’s a consideration for later. There’s an assumption we can build in! The full model looks like this:

\[ \begin{align} y_i & \sim \operatorname{Normal}(\mu, \sigma) \\ \mu & \sim \operatorname{Normal}(1,10) \\ \sigma & \sim \operatorname{LogNormal}(0,10) \\ \end{align} \]

The likelihood function for data \(y_i\) looks like this:

\[ \begin{equation} \operatorname{Pr}(y_i \mid \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}\operatorname{exp}\left[\frac{-(y_i - \mu)^2}{2\sigma}\right] \end{equation} \] Ye gads! But not to worry, this density function is computed in R with dnorm(). This beast of a formula has \(\pi\) in it so it very likely has something to do with circles and rational trigonometry, again for later, some day, perhaps, maybe. It is definitely easier to use the ~ language above to specify models.

We set a feature inside of the dnorm() function to take logarithms of this computation with a log=T argument in dnorm(). This function will compute for the two parameters in p and the data in y, the log posterior density. It we operate on this number with the exp() operator we get the original product of posterior likelihoods, effectively “un-logging” them. We note how we incorporate the two assumptions about the parameters as priors in the computation.

log_lik_norm <- function(parm, y) {
    log_lik <- sum(dnorm(y, 
                         parm["mu"], 
                         parm["sigma"], log = T))  
    # log likelihood + priors
    log_post <- log_lik + dnorm(parm["mu"], 
                                0, 100, log = T) 
                        + dlnorm(parm["sigma"],
                                 0, 10, log = T)
    return(log_post)
}
parm <- c(mu=9, sigma=6)
log_lik_norm( parm, y)

##        mu 
## -2904.583

Yes, a big negative number ensues. A negative log is a number between 0 and 1, so it looks a lot like the \(Pr(d \mid h)\) numbers we have become accustomed to. In this case, the exponential of the log likelihood is very, very close to zero.

So now the quadratic approximation estimator works by inserting the data \(y_i\) and initial guesses about parameter values into an optimizer that searches across values of mu and sigma for the maximum log-likelihood of seeing all of the data y given a mu and sigma. This is a bit different than calculating the arithmetic mean and standard deviation isn’t it? But it is not all different than finding maximum \(Pr(h \mid d)\) when we counted the ways.

inits <- c(mu = 0, sigma = 1)
fit <- optim(inits, 
             log_lik_norm, 
             control = list(fnscale = -1), 
             hessian = TRUE, 
             y = y)
fit$par

##       mu    sigma 
## 5.939872 2.047886

fit$hessian

##                  mu         sigma
## mu    -238.44521570    0.09018004
## sigma    0.09018004 -477.07903900

We get two hard fought for (in the algorithm at least) results of immediate interest. The first is the point estimators of the \(\mu\) and \(\sigma\) which maximizes the likelihood we will match data with hypotheses. The second is a matrix of interactions of \(\mu\) and \(\sigma\) called a Hessian matrix.

What we get from this exercise is a lot more than arithmetic means and standard deviations. We are analyst-skeptical, so we just need to know at all (that is, the compound of experience - understanding - judgment). We need to learn something about how the variations in the data reflect in variations in the mean and standard deviation, as well as how the mean and standard deviation end up interacting as well.

We get the interactions of the mean and standard deviation from the Hessian matrix which indicates the rates of interaction of the 2 parameters we are estimating here. This matrix is part of a routine to search for the steepest ascent/descent across the log-likelihood surface for the best combination of \(\mu,\sigma\) hypothesis pairs during the computation of the variance-covariance of the solution \(5.9398722, 2.047886\). We are not done!

Next we use the fit parameters that have located the Maximum A Posteriori values of the two parameters \(\mu = 5.9398722\) and \(\sigma = 2.047886\) to generate posterior predictive samples, along with samples of \(\mu,\sigma\), plot them, talk about them, and then we are done (for the moment!).

Mathematically if we invert the negative of the hessian matrix, we get back the variance-covariance matrix of the two parameters. Frequentist-(mean)while, the square root of the diagonal of this inverted hessian matrix can form the basis of t-tests of the null hypothesis that \(\mu = 0\) or some other target value \(\mu_0\). But we are not frequentists, so Probabilist-(mean)while we return to sampling.

par_mean <- fit$par
par_varcov <- solve(-fit$hessian)
round( par_mean, 2 )

##    mu sigma 
##  5.94  2.05

round( par_varcov, 4 )

##           mu  sigma
## mu    0.0042 0.0000
## sigma 0.0000 0.0021

Just to put a point on it all, the standard deviations of the estimation are then this diagonal.

round(diag(par_varcov)^0.5, 4)

##     mu  sigma 
## 0.0648 0.0458

In order to sample predictive posterior views, which are really hypotheses of what might be, of \(\mu\) and \(\sigma\) as Gaussian variates, we will need to take into account not only the means and variances of the \(\mu\) and \(\sigma\) parameters, but we will need to include how these parameters would vary with one another with covariances 0. The mvtnorm package will come to our aid. First we draw the samples from the posterior probability urn, then attach a simulation of predictions for \(y\).

We use the coda package to display the distributions this time, since it is a popular way to visualize Monte Carlo simulations of the Markov Chain species. This is the package that McElreath’s rethinking and (purely quap()) rethinking@slim packages employ.

library(mvtnorm)
samples <- rmvnorm(10000, par_mean, par_varcov)
samples <- as.tibble(samples) |> 
  mutate(prediction = rnorm(n = n(), mu, sigma))
# or simplehist(samples |> select(prediction))
# coda plots
library(coda)
samples_mcmc <- mcmc(samples)
densplot(samples_mcmc)

Note the rug of simulated instances along the x-axis. Here is a high density interval view of the modes of the three samplings. A 91% interval width allows us to say that the values of posterior parameters and predictions are between lower and upper bounds with 91% compatibility with the data. These are compatibility, plausibility, posterior probability, credibility intervals. That last adjective is for the insurance analysts in the gang. We build a simple helper function to complete our initial analysis.

library( tidybayes )
library( rethinking )
library( coda )
prob <- c(0.89)
#
# a helper function to make a table of lower and upper
# estimates of parameters and predictions
#
hdpi_tbl <- function( samples, prob=0.89){
  column <- colnames(samples)
  x <- sapply(column, 
      function(c) HPDinterval(mcmc(eval(
        parse(text = concat("samples$", c)))), 
              prob = prob))
  rownames(x) <- c( concat("|lower ", prob), 
                    concat("upper ", prob, "|") )
  return( t(x) )
}
hdpi_samples <- hdpi_tbl( samples, prob=0.89)
hdpi_samples

##            |lower 0.89 upper 0.89|
## mu            5.841273    6.046985
## sigma         1.972836    2.119193
## prediction    2.639179    9.170438

We can expect 91% of the time to see a wide range of predictions for our data even though the mean and standard deviation are much tighter. The tightness of the parameters is much due to regularization.

Here is a posterior predictive plot for customer base hours in our example for the marketing manager.

# the simulation
set.seed(4242)
samples |> 
  ggplot(aes(x = prediction)) +
  geom_histogram(color = "blue", alpha=0.7) +
  geom_vline( xintercept=hdpi_samples[3,1], 
              color="red") +
  geom_vline( xintercept=hdpi_samples[3,2], 
              color="red") +
  labs(title = "Posterior predictive distribution") +
  theme(panel.grid = element_blank())

We just built a visualization of the lows and highs of our model’s ability to predict with principled, intelligible, plausibilities as probabilities. The lower (upper) vertical line is the lower (upper) 91% threshold of the predictive high density probability interval we estimated with our bespoke function and the coda package.

Is an easier way?

Yes, and fortunately for us Richard McElreath’s rethinking@slim will replicate what we have just produced with a minimum of strokes.

The model still looks like this, but with one simplification: we use the uniform distribution to sample \(\sigma\).

\[ \begin{align} y_i & \sim \operatorname{Normal}(\mu, \sigma) \\ \mu & \sim \operatorname{Normal}(1,10) \\ \sigma & \sim \operatorname{Uniform}(0,5) \\ \end{align} \]

Let’s translate this into rethinking-ese.

m_0 <- quap(
  alist(
    y ~ dnorm( mu, sigma ),
    mu ~ dnorm( 1, 10),
    sigma ~ dunif(0,5)
  ), data = d
  )
summary( m_0 )

##           mean         sd     5.5%    94.5%
## mu    5.939603 0.06476846 5.836090 6.043115
## sigma 2.048201 0.04580069 1.975003 2.121400

So it looks like we have a nearly exact match!

Was that enough? We should try just one more model for good measure. Let’s retrieve some data from a Food and Agriculture Organization survey from 2018 through 2022 for bean prices in two eastern provinces of the Democratic Republic of the Congo.

beans <- read.csv( "data/maniema-skivu-bean-inflation-sd-cor.csv")
summary( beans )

##      date           inflation_maniema   inflation_skivu       sd_maniema      
##  Length:128         Min.   :-1.711717   Min.   :-0.693432   Min.   :0.006694  
##  Class :character   1st Qu.:-0.014935   1st Qu.: 0.000000   1st Qu.:0.029829  
##  Mode  :character   Median : 0.002640   Median : 0.000000   Median :0.051483  
##                     Mean   : 0.003414   Mean   : 0.002825   Mean   :0.091475  
##                     3rd Qu.: 0.030766   3rd Qu.: 0.013151   3rd Qu.:0.098014  
##                     Max.   : 0.405465   Max.   : 0.588015   Max.   :0.905625  
##     sd_skivu       maniema_skivu     
##  Min.   :0.00000   Min.   :-0.99969  
##  1st Qu.:0.01034   1st Qu.:-0.42990  
##  Median :0.03069   Median :-0.04155  
##  Mean   :0.08707   Mean   :-0.09762  
##  3rd Qu.:0.12725   3rd Qu.: 0.19213  
##  Max.   :0.54981   Max.   : 0.87248

d_0 <- beans |> tibble(
  M = inflation_maniema,
  S = inflation_skivu,
  week = 1:length(beans$date)
)
d_0 |> ggplot( ) +
  geom_point( aes( x=week, y= M), color="blue" ) +
  geom_point( aes( x=week, y= S), color="red" ) +
  ylim(-0.3,0.3)

Skivu inflation seems neatly nestled between 0.0 and 0.02 across a median of 0.00. There is a strongly held negative and positive outliers for both inflations.

A scatter plot will inform us about some very interesting anomalies in this data.

d_0 |> ggplot( ) +
  geom_point( aes( x=S, y= M), color="blue" ) +
  ylim(-0.3,0.3)

There is not only lots of variability, but those 0.00 Skivu inflations almost indicate a line of demarcation between positive and negative inflation. We will need to address this, but not right at this moment.

We first pose this question: what is the impact of bean price movements in one province on the other? Next we hypothesize a model like this one.

\[ \begin{align} M & \sim \operatorname{Normal}(\mu, \sigma) \\ \mu & = \alpha + \beta S \\ \alpha & \sim \operatorname{Normal}(0,2) \\ \beta & \sim \operatorname{Normal}(0,2) \\ \sigma & \sim \operatorname{Exponential}(1) \\ \end{align} \]

We already have our sample of data, so the next step is to count the ways. Let’s run an exploratory regression. One of the first questions we would ask is about the shape and range of the priors on the regression of M on S.

n <- 50
tibble(group = seq_len(n),
       alpha = rnorm(n, 0, 0.25),
       beta = rnorm(n, 0, 0.5),
       sigma = rexp(n, 3)
       ) |>
  expand(nesting(group, alpha, beta, sigma), 
         skivu = rnorm( n, 0, 0.25) ) |>
  mutate(maniema = rnorm(n(), 
                         alpha + beta * skivu, sigma)) |>
  ggplot(aes(x = skivu, y = maniema, group = group)) +
  geom_line() +
  labs( title = "Checking the Priors",
        x = "skivu", 
        y = "maniema")

There is little or no question whether these priors will pick up variations in maniema inflation.

m_1 <- quap(
  alist(
    M ~ dnorm( mu, sigma ),
    mu ~ aM + bM*S,
    aM ~ dnorm( 0, 0.25),
    bM ~ dnorm( 0, 0.5),
    sigma ~ dexp(3)
  ), data = d_0
  )
precis( m_1, hist=FALSE )

##               mean         sd        5.5%       94.5%
## aM     0.004189073 0.01537465 -0.02038259  0.02876074
## bM    -0.279613555 0.11132746 -0.45753634 -0.10169077
## sigma  0.174237881 0.01085602  0.15688786  0.19158790

How can we read this? We look at the probability intervals. When a parameter value is crossing zero, as aM is, then it is probably zero. The other two intervals are well within the ranges of the sign of each parameter. The parameters appear to be probable. The sigma is fairly tight.

Two more comments supported with some visuals should finish this initial discussion. First, let’s look at the relationship among the parameters.

#library(rethinking)
pairs( m_1 )

We get the off-diagonal correlation clouds for the three parameters and on the diagonal the marginal probability distributions of the parameters.

Second, we generate predictive intervals around the regression line itself. We will use McElreath (2020) (p. 107) manual method to learn a bit more R and to build model-building capability in our analytical bones.

post <- extract.samples( m_1 ) # pull 3 columns of sampled parms
mu_link <- function(S) post$aM + post$bM*S 
                                        # calculator for mu line
S_seq <- seq( from=-0.7, to=0.7, length.out=length(d_0$S))   
                                        # evenly spaced S
mu <- sapply( S_seq, mu_link )          # calculated line
mu_mean <- apply( mu, 2, mean )         # make averages
mu_PI <- apply( mu, 2, PI, prob=0.91)   # make credibility intervals
#summary( mu_CI ) # why 128 variables? S_seq

All of this is buried in the rethinking::link() function which works for simple linear models. We might need to expand our capabilities a bit when we get to more realistic models like our interactive ODEs, thus the manual method.

post_tbl <- tibble(
  S_seq = S_seq,
  lower = t(mu_PI)[,1],
  upper = t(mu_PI)[,2],
  mean = mu_mean,
  M = d_0$M,
  S = d_0$S
)
# plot
p <- post_tbl |>
  ggplot( aes( x=S, y=M ) )+
  geom_point() +
  geom_line( aes( x=S_seq, y=mean ), 
             color="red" ) +
  geom_ribbon( aes( ymin=lower, ymax=upper ), 
               alpha=0.1, fill = "blue",  
               color = "red", 
               linetype = "dotted" ) +
  ylim(-0.5, 0.5) + 
  labs( x = "Skivu bean price inflation",
        y = "Maniema bean price inflation",
        title = "Impact of Skivu bean prices on Maniema",
        subtitle = "91% Predictive Probability Intervals"
        )
p

We are quite obviously not capturing the dynamics of the relationship between Maniema and Skivu bean prices! And look at all of those 0.00 Skivu bean price inflations. Much to learn as we proceed.

Enough for now!

For now at least. We accomplished a lot. But perhaps the biggest takeaway is our new found ability to deploy simulation with probability to index the compatibility of hypotheses with data to arrive at learning about the data.

We learned something because we spent some quality time inferring.

References

Laplace, P. S. de, F. W. Truscott, and F. L. Emory. 1902. A Philosophical Essay on Probabilities. A Philosophical Essay on Probabilities. Wiley. https://books.google.com/books?id=WxoPAAAAIAAJ.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second edition. CRC Press. https://xcelab.net/rm/statistical-rethinking/.

We recall from statistics and any good dictionary that the mode is the most frequently occurring element in a series of events.↩︎