Two rivals

Over the past 100 years Fordham Rams and Manhattan Jaspers rivaled each other in sports, debating, innovation challenges, you name it. This year they attempt to forecast the future, a 5-day forecast for rain in the Bronx. The weather will be rain or shine. Strict rules as to rain or shine are available from the rivals upon request.

Fordham Ram Jill faces off with Manhattan Jasper Will in this contest. Here are their 5-day forecasts for rain $R$ (e.g., 1 = 100% chance of rain) and shine $shine$ (e.g., 0 = 0% chance of rain) along with observations $i$. Our job is to evaluate their performance.

Day	1	2	3	4	5
Jill	1	1	0.6	0.6	0.6
Will	0	0	0	0	0
—-	—	—	—	—	—
$i$	R	R	S	S	S

The competition organizers tried two methods of evaluation. The first is an average forecast hit rate.

Day	calculation	hits/5 days
Jill	$(2 \times 1) + (3 \times (1-0.6))$	$3.2$ hits/5 days
Will	$(0 \times 2) + (3 \times (1-0.0))$	$3.0$ hits/5 days

Most of the competition assessors agree that this measure of performance is too close to call. The informational surprise between the two rivals in just not there.

Jill and Will certainly do not like to get wet! Umbrellas are still in vogue in the Bronx. Both Jill and Will agree that there is some giref experienced. They are given a choice of the level of grief they are willing to suffer if they are wrong about their forecasts given they can carry an umbrella. The amount of grief associated with carrying an umbrella is $-1$ on the Bronx grief index. Worse is to get wet! That’s $-5$ on the Bronx grief index. when the judges apply this rubric to the forecasts, here is their an analysis to support their verdict. Getting wet is 5 times more lossy than carrying an umbrella. It is this relative “times” or “ways” which matters in the rubric.

Day	1	2	3	4	5
Jill	-1	-1	-0.6	-0.6	-0.6
Will	-5	-5	0	0	0
—-	—	—	—	—	—
$i$	R	R	S	S	S

Jill carries her umbrella throughout her forecast. When it rains she grieves at a rate of -1. When it is sunny her expected grief is $0.6 \times (-1) = -0.6$, owing to her forecast of 60% chance of rain.

Will, on the other hand, thinks it will be sunny, does not carry an umbrella, mainly to support his hubris, and suffers much grief to the order of $-5$ on the two rainy days. The rest of the time he is free and clear.

Here are the judges’ calculations.

Day	calculation	grief	-log(grief)
Jill	$(2 \times -1) + (3 \times -0.6)$	$-3.8$	2.74
Will	$(2 \times -5) + (3 \times 0$	$-10.0$	9.21

But who’s counting? The $-log(grief)$ is a measure of the information, the surprise, in each of the two contestants. We are back to including some extra-statistical skin in the game, namely, a Joseph de Finetti bet.¹ Why a $log()$ transformation? We will soon see why.

Information

Suppose we communicate through a channel called a relay. This is a physical device for sending and receiving a simple yes or no between two persons. There are thus only two semantic words “yes” and “no” represented by two symbols “1” and “0”. Instead of the typical electrical circuitry used to describe this process, we will use at least the biological manifold in a physical medium of air.² We imagine that air waves link the mouth anatomy of the Homo sapiens sender (labial, alveolar, laryngeal, dental, lingual apparatus, trigeminal and vagus nerves) of the typical Homo sapiens and the ear anatomy (pineum, tympanum, stapes, malleolus, semi-circular canals, conchleia, 7th and 8th cranial nerves) of the receiver.³ The simplest sequence would begin with a yes followed by perhaps another yes or perhaps another no. So the information sequence is not enough! The “perhaps” will raise a further question of a plausible sequence.

Physicist, philosopher and theologian (Polkinghorne?) defines information as “the specification of dynamical patterns of behavior and energy flow” (Ibid., p. 78). Aristolean as he is, for (Polkinghorne?) a pattern is an intelligible, understandable, artifact which is the formal cause itself within persons. Manifolds (top) of existing nature, ideas, polities, guide, influence, sometimes dictate, the components (down) subsumed by the manifolds. A form, as cause and finality, expressed in a pattern carries content, like the genetic code represented by the set of DNA molecular nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T). As important are these components, the manifold intelligibility of DNA, as DNA, is the order of these four neocleotide bases. The manifold is more than the sum of the parts, it is The order of these bases called the DNA sequence. Skepticism abounds in our culture, one that reduces all inference, and information within inference, to the parts, not the whole.^[(Lonergan?)(1957) argues for a critical realism where both scientific and metascientific conclusions interact actively, dynamically at any particular time and space and across time, matter, and spirit. Moreover we are able to find reasons for our reasons, that is to be reflexively self-critical. This cannot be possible without a non-reductionist science which must point to an overarching frame, form, pattern, manifold of the reason for the reasons. The bottom-up inferences of the parts, the flow of information, dynamically interact with the top-down transcending manifold of the font of wisdom.⁴

We consider two ways a set of information components $\{A, B, C\}$ can form a two-step communication of these outcomes if our sender transmits $A$: $A$ again $AA$, $B$ that is $AB$, or $C$ that is $AC$. We need only 1 bit, one element, of the information set to transmit $A$ again, and 2 bits each to transmit $AB$ or $AC$.

\usetikzlibrary{arrows}
\usetikzlibrary {positioning}
\begin{tikzpicture}[node distance=2cm, auto,>=latex', thick, scale = 0.5]
\node (P) {$A$};
\node (A) [right of=P] {$AB:\,\frac{1}{3}\cdot 2\,bits$};
\node (B) [above of=A] {$AA:\,\frac{1}{2}\cdot 1\,bit$};
\node (C) [below of=A] {$AC:\,\frac{1}{6}\cdot 2\,bits$};
\draw[->] (P) to node {$\frac{1}{2}$} (B);
\draw[->] (P) to node {$\frac{1}{3}$} (A);
\draw[->] (P) to node {$\frac{1}{6}$} (C);
\end{tikzpicture}

Funky tikz

Here is a more complicated version of the same information outcomes.

\usetikzlibrary{arrows}
\usetikzlibrary {positioning}
\begin{tikzpicture}[node distance=2cm, auto,>=latex', thick, scale = 0.5]
\node (P) {$A$};
\node (D) [below right of=P] {};
\node (B) [above right of=D] {$AB:\,\frac{1}{2}\cdot\frac{2}{3}\cdot 2\,bits$};
\node (A) [above of=B] {$AA:\,\frac{1}{2}\cdot 1\,bit$};
\node (C) [below right of=D] {$AC:\,\frac{1}{2}\cdot\frac{1}{3}\cdot 2\,bits$};
\draw[->] (D) to node {$\frac{2}{3}$} (B);
\draw[->] (P) to node {$\frac{1}{2}$} (A);
\draw[->] (D) to node {$\frac{1}{3}$} (C);
\draw[->] (P) to node {$\frac{1}{2}$} (D);
\end{tikzpicture}

Funky tikz

The average information, the expected value $<I>$, of each information structure is the weighted average of the bits. We note that they have the same three outcomes but get at them by different means.

\[ <I_1> = (1/2) \times 1 + (1/3) \times 2 + (1/6) \times 2 = 1.5 \]

For the second structure, which we recall yields the same outcomes we get an unsurprising result.

\[ <I_2> = (1/2) \times 1 + (1/2) \times [(2/3) \times 2 + (1/3) \times 2] = 1.5 \]

Their average information is the same as it should be as they both get to the same place but with different complexity. It is interesting that when we replace the bits with the logarithm of the probability, we get a similar result. But this is no longer average information.

\[ log_2 (exp( 1 )) = 1.4 \]

What is the scale of difficulty when we focus on the way bits might evolve instead of the results of the evolution? This is the stuff of entropy. We now continue our search.

Which structure is more uncertain? Which one is more difficult to predict the outcomes? Do we need any knowledge of the outcomes at all?

Uncertainty

We want to use our information structures to measure the level of uncertainty, that is the amount of uninformativeness, surprise, one possible indicator of awe and wonder. With data and systems we translate surprise into difficulty to predict. The most difficult to predict is a system whose outputs are equally likely? Why? Flipping an unbiased coin can go either way. Which way? We cannot predict this. So any measure of difficulty, of uncertainty of the outcome, would give the highest amount. Thus a probability split of (1/2, 1/2) between two choices would be the most difficult. But much less difficult would be a (1/10, 9/10) or (9/10, 1/10) split.

Three desiderata emerge from Shannon (1948): “Can we find a measure of how much ‘choice’ is involved in the selection of the event or of how uncertain we are of the outcome?” (Ibid., p. 10). What we are looking for we will call $H(p_i)$.

$H$ shall be continuous. This means no holes or singularities (which can be resolved if necessary!). Operationally this requirement means we will prepare sequences of probabilities and outcomes both aligned or matched with one another, mutually exclusive and exhaustive of the space of events we investigate.
Equal $p_i$ will allow the most choice, meaning the most uncertainty in an informational context.
There is no uncertainty when there no chance at all or if something always happens. Thus we ask that $H(0) = H(1) = 0$
In our second informational structure the outcomes are 1/2 AA, but the number of ways either AB or AC outcomes can occur are influenced by whether they jointly occur or AA occurs. Yikes! This means that the ways an outcome occurs must be allowed to be decomposed into the ways their successive outcomes (AB or AC) might occur.

For our examples this means that for requirement 3, using requirements 2 and 1 we have this constraint.

\[ H\left(\frac{1}{2},\frac{1}{3},\frac{1}{6}\right) = H\left(\frac{1}{2},\frac{1}{2}\right) + \frac{1}{2}H\left(\frac{2}{3},\frac{1}{3}\right) \] The deceptive simplicity and elegance of this constraint demands an exploration. In our search for an $H$ which fulfills our requirements we will see why $log()$ scores might “work.”

What $H$ preserves this obvious equality? Let’s start with what is almost obvious. Let’s take the expected value of the 1 bit result $AA$. Its expected value is

\[ <I(AA)> = \frac{1}{2} \times 1 = \frac{1}{2} \]

What is that 1? We have a 2 bit communication system. The number of ways we can get 1 bit in this model is simply $2^1 = 2$ in total. Here we have 1 way or $1/2$ of the ways. We need only 1 bit, have 2 to use, so we have 1 left over, thus $1-2=-1$. Yes, just simple counting, and we overlook the obvious! The exponential form tells us if we have two bits, nats, things, and have 1 left over, that is $-1$, one time this is $2^{(1-2)}=2^{-1}=1/2$. The amount of leftover is the amount of bits we don’t know what to do with, and this represents our lack of understanding of what will happen next, a state of affairs we somehow call uncertain. But the rate of getting this information is just 1 bit. What expression returns the rate of growth of, in this case, information? The logarithm does by its very definition. If the exponential operator gives us the amount of growth at a given rate, its inverse returns the given rate. We can get that leftover bit by this simple operation.

\[ log_2(2^{-1}) = log_2\left(\frac{1}{2}\right) = -1 \] To get the positive statement of how much is leftover, we just multiply by $-1$. Now, let’s clear the decks again and instead of looking at bit outcomes let’s view the probabilities of a message as the actual carrier of the bits. In this projection a probability of $1/3$ means we have the equivalent of 3 bits and use only one to form a piece of information. If this projection makes that kind of sense, then we have

\[ \frac{1}{3} \rightarrow 1-3=-2 \]

bits left over. More bits left over means less predictable, more difficult, to guess at what the next piece of information will be. Isn’t it more difficult than a probability of $1/2$?

\[ \frac{2}{3} \rightarrow 1-2=-1 \]

But let’s hold on! Is this the same degree of difficulty as 1.2? In We look for the number of ways a 2 bit set generate a $2$ bit message out of the number of $3$ bits, and it looks like a bit from the set will be repeating. A 2 bit message from a 2 bit set would be $AB$, a 2 bit piece of information in a 2 bit world of $\{A,B\}$. A 3 bit example in a 2 bit world would be $ABB$, 3 bits long which consumes 2 bits with one of the 2 bits repeating. Simple multiplication is at work here.

\[ log_2(2 \times 3^{-1}) = log_2\left(\frac{2}{3}\right) = -0.58 \] Thus a 3 bit long message which consumes 2 bits leaves $-0.58$ bits of information behind. What about a the degree of predicatibility with a probability of $2/3$? Our arithmetic should be consonant with our intuition. Higher probability, less difficult to predict, to guess at the next move, so we should have less bits left over. And yes, with the proper interpretation of bits, information as compounds of bits, and a useful way to ferret out exponents with logarithms, we might be onto something here. The result is that $-0.58 < -1$ so that, lets call it $U$ (for uncertainty for the moment),

\[ U\left(\frac{2}{3}\right) < U\left(\frac{1}{2}\right). \] In mathematical language, with $p_i$ and $p_j$ for two probabilities, we would say that $H(p)$ is monotonic in $p$, for any two probabilities $p_j < p_i$.

p <- seq(0.001, 0.999, 0.1)
p_rev<- 1 - p 
U <- -log(p)
U_rev <- -log(p_rev)
tibble(
  p=p,
  U=U
) |>
  ggplot( aes( x=p, y=U)) +
  geom_line( color="blue") +
  geom_point( shape=21, color="red") +
  geom_line( aes( x=p, y=U_rev), color="red") +
  geom_point( aes( x=p, y=U_rev), shape=21, color="blue") +
  labs( title = "Ups and downs of Uncertainty",
        subtitle = "-log(p) and -log(1-p)"
        )

The blue line runs the $-log(p)$ relationship with $0<p<1$ computationally. In reverse the red line runs the $-log(1-p)$ relationship. They meet at $p=1-p=0.5$. Both are a bit convex.

We finally get to the end of our journey. We can now define uncertainty across all of the probabilities which help us guess about outcomes as the average uncertainty for all particular outcomes. And we do not even need to know the outcomes! We only need to specify the probabilities, the chance of an outcome. This measure we call $H(p)$, perhaps after Hartley who started this discussion in 1928 and is cited in Shannon (1948).

\[ H(p) = -\sum_{i=1}^N p_i log(p_i) \]

For the case of $N=2$, a binary choice situation, we have $p_1$ and $p_2$. We let $p=p_1$, and since $p_1+p_2=p + p_2=1,$ so that $p_2=1-p$. Now we can compute and visualize.

\[ H(p) = -\sum_{i=1}^N p_i log(p_i) = -[p log(p) + (1-p) log(1-p)] \]

If $p=0.3$, then $1-p=0.7$, so that

\[ \begin{align} H(p) &= -[p log(p) + (1-p) log(1-p) ] \\ &= -[(0.3)log(0.3)+ (0.7)log(0.7)] \\ &= -[(0.3)(-1.21) + (0.7)(-0.36)] \\ &= 0.61 \end{align} \]

The result is commutative with $p$, so it does not matter which branch of the information structure tree a probability hangs from. We check with R.

p <- 0.3
H <- -(p*log(p) + (1-p)*log(1-p))
H

## [1] 0.6108643

Isn’t that wonderful? Even more splendor, clarity, and perhaps some insight can be gained from running possible pairings of $p$ and $1-p$, graphing them as well. We reuse some code.

options( digits=2 )
p <- seq(0.001, 0.999, 0.1) # avoids NANS: 0<p<1
p_rev<- 1 - p 
U <- -log(p)
U_rev <- -log(p_rev)
H <- p*U + p_rev*U_rev
max_H <- max( H )
max_p <- p[ which.max(H) ]
max_label <- concat("maximum H(p) = ", round(max_H, 2), "/n at p = ", round(max_p, 2 ) )
scale <- 10 # to get the units in range of one another
           # the secondary axis will display original unscaled units 
        y= "Uncertainty (bits)"
tibble(
  p=p,
  U=U,
  H=H
) |>
  ggplot( aes( x=p, y=U)) +
  geom_line( color="blue", size=1.5) +
  geom_point( shape=21, color="red", size=1.5) +
  geom_line( aes( x=p, y=U_rev), color="red", size=1.5) +
  geom_point( aes( x=p, y=U_rev), shape=21, color="blue", size=1.5) +
  geom_line( aes( x=p, y=H*scale), color="grey10", size=1.25) +
  geom_point( aes( x=p, y=H*scale), shape=21, color="darkgreen", size=1.5) +
  geom_vline( xintercept=max_p, color="grey75", linetype="dashed", size=0.75) +
  geom_point( aes(x=max_p, y=max_H*scale), color="red", shape=22, size=4.0) +
  scale_y_continuous(sec.axis = sec_axis(~./scale, name="Average Uncertainty H(p) bits")) +
  annotate( "text", x=0.77, y=6.99, label=max_label)+
  labs( title = "Ups and downs of Uncertainty",
        subtitle = "-log(p), -log(1-p), H(p)",
        y = "Uncertainty -log(p) bits"
        )

There, we have it all! We can read Shannon (1948) Information Entropy $H(p)$ off the secondary axis. The maximum entropy is approximately 0.69 when the odds are equal, $p=0.5$. The relationship of $H(p)$ to $p$ is approximately quadratic. From the definition of the logarithm operator for $0<p<2$ we have

\[ \operatorname{Log}(p) = (p-1) - \frac{1}{2}(p-1)^2 + \ldots\,\,. \] We can use this definition of the ongoing power series, an approximation, to derive in our binary case up to the second power, that

\[ H(p) = \frac{5}{2}{(p - p^2)} \]

We can easily check that $H(0)=H(1)=0$ so that this function, unlike our $log()$ associate, is defined without singularities on $0 \le p \le 1$. Singularities generate Nans (not a number) in R.

We can calculate the optimal $p$ and $H(p)$ by taking the (Faulhaber) first derivative $D_1(p)$, and upon setting this derivative equal to zero and solving for $p$.⁵ Here are some details.

\[ D_1(H(p)) = \frac{5}{2}(1 - 2p) = 0 \] only when $p=0.5$. $H(p)$ describes a concave (cupped down) quadratic relationship with $p$ centered at $p=0.5$, also the value where the optimal $H^*(p=0.5) = 0.62$, close to the computer’s approximation of $0.69$. The second derivative is negative indicating a local maximum.

What do we do with all of this? Produce another graph.

options( digits=2 )
p <- seq(0.001, 0.999, 0.1) # avoids NANS: 0<p<1
p_rev<- 1 - p 
U <- -log(p)
U_rev <- -log(p_rev)
H_comp <- p*U + p_rev*U_rev
H_quad <- (5/2)*( p - p^2 ) # derived from the power series 
                            # definition of Log(p) taken to a 
                            # quadratic approximation: and no need for
                            # o(n)!
max_H_comp <- max( H_comp )
max_p_comp <- p[ which.max(H_comp) ]
max_H_quad <- max( H_quad )
max_p_quad <- p[ which.max(H_quad) ]
max_label_comp <- concat("max computer H(p) = ", round(max_H_comp, 2), " at p = ", round(max_p_comp, 2 ) )
max_label_quad <- concat("max quadratic H(p) = ", round(max_H_quad, 2), " at p = ", round(max_p_quad, 2 ) )
scale <- 10 # to get the units in range of one another
           # the secondary axis will display original unscaled units 
        y= "Uncertainty (bits)"
tibble(
  p=p,
  U=U,
  H=H
) |>
  ggplot( aes( x=p, y=U)) +
  geom_line( color="blue", size=1.5) +
  geom_point( shape=21, color="red", size=1.5) +
  geom_line( aes( x=p, y=U_rev), color="red", size=1.5) +
  geom_point( aes( x=p, y=U_rev), shape=21, color="blue", size=1.5) +
  geom_line( aes( x=p, y=H_comp*scale), color="grey10", size=1.25) +
  geom_point( aes( x=p, y=H_comp*scale), shape=21, color="darkgreen", size=1.5) +
  geom_line( aes( x=p, y=H_quad*scale), color="grey10", size=1.25) +
  geom_point( aes( x=p, y=H_quad*scale), shape=21, color="darkgreen", size=1.5) +
  geom_vline( xintercept=max_p_comp, color="grey75", linetype="dashed", size=0.75) +
  geom_point( aes(x=max_p_comp, y=max_H_comp*scale), color="red", shape=22, size=4.0) +
    geom_point( aes(x=max_p_quad, y=max_H_quad*scale), color="red", shape=22, size=4.0) +
  scale_y_continuous(sec.axis = sec_axis(~./scale, name="Average Uncertainty H(p) bits")) +
  annotate( "text", x=0.77, y=6.99, label=max_label_comp)+
    annotate( "text", x=0.23, y=6.99, label=max_label_quad)+
  labs( title = "Ups and downs of Uncertainty",
        subtitle = "-log(p), -log(1-p), H(p)",
        y = "Uncertainty -log(p) bits"
        )

At this moment we might ascribe the differences to algorithmic precision. The shapes appear similar and both are symmetric about $p=0.5$. Both approaches an optimum at this point on the $p-axis$. An algebraic geometry of $H(p) = (5/2)(p-p^2)$ will confirm a constant tangent line at $p=0.5$ where the first (sub-)Derivative is exactly zero.⁶

Next we use all of this uncovering of the nature of uncertainty by applying our hard-fought for knowledge to a move from uncertainty to accuracy.

Diverging Deviances

Kullback (1951) show us how to use information entropy to measure the additional uncertainty we impose on our analysis when we use one probability distribution to describe another probability distribution. In effect we were trying to figure this out with Jill and Will. The Kullback-Leibler divergence $D_{KL}$ is for $N$ occurrences,) $p_i$ probabilities from one distribution and $q_i$ from another distribution.

\[ D_{KL} = \sum_{i=1}^N p_i log\left(\frac{p_i}{q_i}\right) = \sum_{i=1}^N p_i [log(p_i)- log (q_i)] \]

The last expression takes advantage of the quotient property of logarithms, on the reasons we use logarithms to reduce measurement to yardstick (meter stick) comparisons.

Suppose we ask about how different these two binary distributions are from one another: $p_i$ and $q_i$.

tibble(p_1 = 0.3,
       p_2 = 0.7,
       q_1 = 0.60,
       q_2 = 0.40
       ) |>
  mutate(D_KL = p_1*(log(p_1) - log(q_1)) + p_2 * (log(p_2) -log(q_2)) )

## # A tibble: 1 × 5
##     p_1   p_2   q_1   q_2  D_KL
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   0.3   0.7   0.6   0.4 0.184

One more ensemble begs our attention.

tibble( p_1 = 0.7,
        p_2 = 0.3,
        q_1 = 0.40,
        q_2 = 0.60
       ) |>
  mutate( D_KL = p_1*(log(p_1) - log(q_1)) + p_2 * (log(p_2) -log(q_2)) )

## # A tibble: 1 × 5
##     p_1   p_2   q_1   q_2  D_KL
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   0.7   0.3   0.4   0.6 0.184

But this arrangement is as uncertain as the last. We notice that when we change the information structure of $q_i$ we leave more bits behind, thus more information uncertainty in induced in this arrangement. But it is possible that two different ensembles of information can be different in structure but have the same information entropy difference.

In this next experiment we let $p_i$ be the same but vibrate $q_i$ a lot, and of course we want to plot this thing. The tibble will this time require two mutate() moves and we still remember that for computational purposes with log() we need to specify probabilities such that $0 < q_i < 1$ to avoid Nans.

d_D_KL <- tibble( p_1 = 0.7,
                  p_2 = 0.3,
                  q_1 = seq(from = 0.001, to = 0.999, by = 0.01)
       ) |>
  mutate( q_2 = 1- q_1) |> 
  mutate( D_KL = p_1*(log(p_1) - log(q_1)) + p_2 * (log(p_2) -log(q_2)) )
  min_D_KL <- min( d_D_KL$D_KL )
  min_q <- d_D_KL$q_1[ which.min( d_D_KL$D_KL )]
  min_label <- concat("minimum Dkl(p,q) = ", round(min_D_KL, 2), " at q_1 = ", round(min_q, 2 ) )
d_D_KL |>
  ggplot( aes( x=q_1, y=D_KL)) +
  geom_line( color="blue", size=1.5) +
  geom_point( shape=21, color="red", size=1.5) +
  geom_vline( xintercept=min_q, color="grey30", linetype="dashed", size=0.75) +
  geom_point( aes(x=min_q, y=min_D_KL), color="red", shape=22, size=4.0) +
  annotate( "text", x=0.4, y=2.0, label=min_label)+
  labs( title = "Divergence",
        subtitle = "Kullback-Leibler",
        y = "Divergence bits"
        )

We are now armed with the basics for understanding information criteria to compare, and even select, models based on their divergence from one another.

WAIC stands for ____ ____ ? Yes, it an information criterion (IC) and dutifully computes a statistic that tests the ability of a model to predict. Log odds ratios are involved, as well as the number of parameters, and thus rudimentary measure of the complexity of a model. This criterion approximates the out-of-sample deviance that converges to the cross-validation approximation in a large sample.

A little bit of background is in order. The W and the A in $WAIC$ can easily refer to Sumio Watanabe (2010), as well as the more colloquial Widely Available Information Criteria. ⁷

The cross-validation approximation, especially using the $N$-fold leave-one-out (LOO) strategy results in the Pareto-smoothed Importance Sampling (PSIS) cross-validation approach. Both WAIC and PSIS go beyond simple goodness of fit, such as, $R^2$ and ANOVA with its $F$ distribution. The measure attempts to weigh observations by their importance to the predictive accuracy of the model.

WAIC is the log-posterior-predictive density ($lppd$, that is, the Bayesian deviance also used in the PSIS cross-validation approach) and a penalty proportional to the variance in posterior predictions. The penalty might be analogous to a ridge regression parameter.The penalty resolves the problem of singularities in methods like mixed distribution and neural network models. We are still only concerned now with predictive accuracy, not at all about confounding variables and causal inference, respectively, a local variate and global modeling issues.

Here is $WAIC$ in all its glory. We will be unpacking this measure very soon with live data.

\[ \begin{equation} WAIC(y, \Theta) = −2(lppd − \underbrace{\Sigma_i var_{\theta}\,log \,\,p(y_i|\theta))}_{penalty} \end{equation} \]

The $lppd$ comes from the notion of the Kullback-Leibler (KL) model of divergence: the additional uncertainty induced by using probabilities from one distribution to describe another distribution. As we have seen KL divergence is the distance in log odds ratios between two models. The two models in our framework are these: model 1 leaves one observation out of the calculation of posterior probabilities; model 2 uses all of the observations. We must iterate model 1, leaving one observation out at a time, the N-fold approach to cross validations. We can expect lots of computing to make our day now.

⁸

The Bayesian version of the log-probability score is called the log-pointwise-predictive-density for some data $y_i$ and posterior distribution of parameters $\Theta$. With $N$ observations and fitting the model $N$ times, dropping a single observation $y_i$ each time, then the out-of-sample lppd is the sum of the average accuracy for each omitted y_i.

$$ \[\begin{equation} lppd(y, \Theta) = \Sigma_i \, log \frac{1}{S} \Sigma_s p(y_i | \Theta_s) \end{equation}\]

where $S$ is the number of samples and is the $s$-th set of sampled parameter values in the posterior distribution with values $y_i$ and $\Theta_s$ the hypotheses used with sample $s$ in the leave-one-out cross validation scheme.

\[ \begin{equation} S(q) = \Sigma_i log(q_i) \end{equation} \]

Whenever we see logs we know, or at least suspect, that we have scores.

The penalty measure in WAIC is part and parcel of a correction in non-Bayesian models called a ridge regression. The ridge is a parameter that tunes the fit in a regularizing way, much like choosing a fairly tight prior to squeeze only enough information of data and deposit only enough of the likelihood mass into the posterior probability distribution. We can shorten the penalty into an effective number of parameters statistic $p_{WAIC}$, much like how the measurement of full-time employed relates to head count.

We finally get to calculate something!

Taxes anyone?

Let’s use the Laffer dataset loaded with the rethinking package to illustrate what we have so far. We begin a regularization process by standardizing the variables.

library(rethinking)
library(tidyverse)
data(Laffer)
summary(Laffer)

##     tax_rate   tax_revenue  
##  Min.   : 0   Min.   :-0.1  
##  1st Qu.:22   1st Qu.: 2.2  
##  Median :28   Median : 3.1  
##  Mean   :26   Mean   : 3.3  
##  3rd Qu.:33   3rd Qu.: 3.6  
##  Max.   :35   Max.   :10.0

d <- Laffer
d |>
  ggplot( aes( x=tax_rate, y=tax_revenue)) +
  geom_point( color = "blue" ) +
  geom_density2d( color = "red" )

d$T <- (d$tax_rate - mean(d$tax_rate) ) / sd(d$tax_rate)
d$R <- (d$tax_revenue - mean(d$tax_revenue) ) / sd(d$tax_revenue)
m <- quap(
alist(
  R ~ dnorm(mu,sigma),
  mu <- aR + bR*T,
  aR ~ dnorm(0,0.1), # regularized priors
  bR ~ dnorm(0,0.5),
  sigma ~ dexp(1)
) , data = d )
precis( m )

##           mean    sd   5.5% 94.5%
## aR    -8.7e-08 0.086 -0.138  0.14
## bR     2.9e-01 0.164  0.023  0.55
## sigma  9.2e-01 0.118  0.729  1.11

set.seed(42)
post <- extract.samples(m,n=1000)
summary(post)

##        aR               bR            sigma     
##  Min.   :-0.276   Min.   :-0.28   Min.   :0.58  
##  1st Qu.:-0.055   1st Qu.: 0.17   1st Qu.:0.84  
##  Median : 0.001   Median : 0.28   Median :0.91  
##  Mean   : 0.000   Mean   : 0.28   Mean   :0.92  
##  3rd Qu.: 0.056   3rd Qu.: 0.40   3rd Qu.:1.00  
##  Max.   : 0.299   Max.   : 0.85   Max.   :1.34

head(post)

##       aR   bR sigma
## 1  0.022 0.53  1.17
## 2 -0.024 0.20  0.99
## 3 -0.149 0.35  1.03
## 4 -0.173 0.39  0.95
## 5 -0.111 0.34  0.80
## 6  0.032 0.26  0.85

We compare this with the OLS regression.

m_OLS <- lm(R ~ 1 + T, data = d)
summary(m_OLS)

## 
## Call:
## lm(formula = R ~ 1 + T, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.958 -0.553 -0.203  0.171  3.625 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -7.91e-17   1.79e-01    0.00    1.000  
## T            3.20e-01   1.82e-01    1.75    0.091 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.96 on 27 degrees of freedom
## Multiple R-squared:  0.102,  Adjusted R-squared:  0.069 
## F-statistic: 3.07 on 1 and 27 DF,  p-value: 0.0909

Much the same story emerges here. The F-statistic helps us reject the alternative hypothesis that both the intercept and the slope are not zero at a stargazingly low degree of plausibility. Ahh, but this routine does not at all help us identify those observations which might more, or less, influence our over-fitting issues. The two models do seem to be in some sort of agreement on the size and direction of the point estimates.

What about our quap model? First, we compute the log-likelihood of each observation $i$ at each sample $s$ from the posterior:⁹

n_samples <- 1000
log_prob <- sapply( 1:n_samples ,
function(s) {
  mu <- post$a[s] + post$b[s]*d$T
  dnorm( d$R , mu , post$sigma[s] , log=TRUE )
} )
str(log_prob)

##  num [1:29, 1:1000] -1.11 -1.2 -1.41 -1.08 -1.08 ...

The str() view shows a 1000 samples column matrix with scores for each observation in each row.

Computing WAIC

With all of this derived data available, we now compute $lppd$, the Bayesian deviance, by averaging the samples in each row, taking the log, and adding all of the logs together (the sum of logs are like the products of probabilities, all logical both-ands. We have to use the function log_sum_exp to compute the log of a sum of exponentiated terms, else we might be as precise as we need to be, let alone experience under and overflow arithmetical headaches. After all of that we subtract the log of the number of samples, and thus the log of the average.

n_obs <- nrow( d )
lppd <- sapply( 1:n_obs , function(i) log_sum_exp(log_prob[i,]) - log(n_samples) )
sum(lppd)

## [1] -38

We will need sum(lppd) for the total lppd in the WAIC formula. The penalty term p_{WAIC} adds up the variance across samples for each and every observation.

pWAIC <- sapply( 1:n_obs , function(i) var(log_prob[i,]) )
sum(pWAIC)

## [1] 6

summary(pWAIC)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.2     0.0     4.7

Again sum(pWAIC) returns total pWAIC in the formula. At last we compute WAIC like this.

 -2*( sum(lppd) - sum(pWAIC) )

## [1] 89

We can compute the standard error of WAIC through the square root of number of cases multiplied by the variance over the individual observation terms in WAIC. Why the -2? We get a positive statistic. The 2 apocryphally came from the base 2 number system used by Shannon in developing information theory. The traditional reason is that the sum of $lppd$ and the sum of $pWAIC$ act like a likelihood ratio, and in logs a difference, thus deviance. The numerator and denominators are chi-squared distributed with so many degrees of freedom in parameters and data. Scaling by 2 helped with the construction of likelihood ratio distribution tables. So says the received tradition. We recall that we multiplied the information entropy calculation by -1 to get bits not just bits left over.

waic_vec <- -2*( lppd - pWAIC )
sqrt( n_obs*var(waic_vec) )

## [1] 23

Influence

Since each observation has a penalty in the $p_WAIC$ vector, we attempt to identify those observations than contribute to overfitting using the pointwise=TRUE feature in the rethinking::WAIC() function.

WAIC(m, pointwise = TRUE)

##    WAIC  lppd penalty std_err
## 1   3.8 -1.44   0.477      24
## 2   1.8 -0.89   0.024      24
## 3   2.2 -1.06   0.051      24
## 4   1.8 -0.89   0.020      24
## 5   1.8 -0.86   0.017      24
## 6   2.0 -1.00   0.019      24
## 7   2.7 -1.31   0.042      24
## 8   2.8 -1.40   0.015      24
## 9   1.7 -0.84   0.017      24
## 10  1.7 -0.85   0.016      24
## 11  4.5 -2.14   0.115      24
## 12 26.4 -7.70   5.492      24
## 13  3.5 -1.72   0.052      24
## 14  2.4 -1.20   0.010      24
## 15  1.7 -0.82   0.017      24
## 16  1.7 -0.83   0.017      24
## 17  1.7 -0.82   0.017      24
## 18  1.7 -0.83   0.017      24
## 19  2.9 -1.43   0.044      24
## 20  1.7 -0.85   0.017      24
## 21  1.8 -0.90   0.015      24
## 22  1.7 -0.84   0.018      24
## 23  1.8 -0.87   0.019      24
## 24  1.9 -0.93   0.020      24
## 25  2.0 -1.00   0.019      24
## 26  2.5 -1.24   0.027      24
## 27  2.6 -1.28   0.037      24
## 28  2.8 -1.34   0.052      24
## 29  2.1 -1.04   0.026      24

Yes, the 12th observation has the highest deviance (WAIC version thereof) and the highest variance of the posterior distribution, that is, the highest penalty. Influential? Informative? Surprising? Again, yes, yes and yes.

Another model?

Of course! Let’s estimate then compare this model with the basic Laffer model above. Let’s compare the Gaussian model with a so-called robust model using Student’s-t distribution with somewhat thicker tails. The 2 degrees of freedom will definitely thicken those tails.

# linear model with student-t
m_t <- quap( alist(
  R ~ dstudent( 2 , mu , sigma ),
  mu <- a + b*T,
  a ~ dnorm( 0 , 0.2 ), # regularized priors
  b ~ dnorm( 0 , 0.5 ),
  sigma ~ dexp(1)
) , data=d )
precis( m_t )

##        mean    sd    5.5% 94.5%
## a     -0.17 0.095 -0.3245 -0.02
## b      0.19 0.123 -0.0051  0.39
## sigma  0.44 0.095  0.2928  0.60

WAIC( m_t )

##   WAIC lppd penalty std_err
## 1   75  -33     3.9      14

WAIC(m)

##   WAIC lppd penalty std_err
## 1   92  -38     7.8      26

Now let’s compare the two models.

# the default comparison is our new friend WAIC
compare( m, m_t )

##     WAIC SE dWAIC dSE pWAIC  weight
## m_t   74 13     0  NA   3.8 0.99968
## m     91 24    16  16   6.9 0.00032

Which would you choose? Here is where we can stand:

Goodness of fit always favor more complex models.
Information divergence is the right measure of model accuracy, but it too leads to more complex models.
Regularizing priors skeptical of extreme parameter values will begin to trade off complexity with accuracy and thus reduce overfitting.
Use of multiple criteria: cross-validation (LOO), PSIS, WAIC will complement regularizing priors.

So onward to model selection. The first right thing we did was contrive two different models. One will deliver mesokurtic tails, the other leptokurtic tails. So which one? The one with the lower of the criterion values because these are deviance, distance, divergence measures. Less divergence means better predictive capability. Less deviance from the data means more plausible hypotheses that are compatible with, consistent with, the data. On this basis we choose the robust Student’s-t model. But that will this help us determine the best predictive model?

Anticipating the best predictive model we can compare the two as the difference in WAIC scores.

waic1 <- WAIC( m , pointwise=TRUE )
waic2 <- WAIC( m_t , pointwise=TRUE )
dwaic <- waic1 - waic2
plt_title <- "Taxes: model comparison"
waic_effort <- tibble( dwaic = dwaic$WAIC, T = d$T, tid = paste0("tax id ", 1:29) )
p <- waic_effort %>% 
  ggplot(aes(x = dwaic, y = T, label = tid)) +
  geom_vline( xintercept = 0, color = "red") +
  geom_point(color = "blue") +
  geom_text(aes( label = ifelse(dwaic > 10, as.character(tid),'')), hjust = 1, vjust = 1) +
  xlab("WAIC difference: the tails have it") + ylab("tax rate") +
  ggtitle( plt_title )
p

We might investigate tax id number 12. We also notice that the rest of the observations fall close to zero difference in their WAIC’s. Perhaps the two models are not that different after all?

The prediction-causality tradeoff

What about statistical causality and inference? And how do we get to causality? Another day and time will help point the way. In the meantime, it is important to realize that causally incorrect models often, in practice, produce seemingly accurate predictions. For accurate causal inference we really need a higher viewpoint, that of the heuristic structure of the problem we are trying to solve. What does that all mean? Statistical knowledge is simply inferior to, subsumed by, immanent in a higher integration of statistical knowledge with the knowledge which comes from development (think history and genetics) and dialectic (think alternative world views and horizons).

We will take a causal model of interaction between two customer bases to data, but it still might not provide better predictive power, that is, lower deviance or D_{KL} or WAIC or log likelihood, or PSIS. We are now into analyst horizons and constructive models of behavior that prescind from the data, but will ultimately be applied to the data for at least predictive validation.

Stargazing with Cassiopeia and Cepheus

The purportedly vain and beautiful Cassiopeia reigned as queen of Aethiopia with her husband King Cepheus. She allegedly committed the sin of hubris by saying that she and her daughter Andromeda were more beautiful than the daughters of the sea god Nereus, known as the Nereids. An enraged Poseidon sent the sea monster Cetus to plague the coasts of Aethiopia. An oracle bade Cepheus and Cassiopeia to sacrifice Andromeda in order to appease Poseidon. They chained her daughter to a rock on the shore as a sacrifice to Cetus. Perseus, her lover, saved her and killed the beast, and married Andromeda.

Poseidon would not be outdone by the mortal Perseus. He still punished Cassiopeia. He tied her to a chair in the heavens forever. She would revolve upside down half of the time as the earth moved through time and the cosmos. One inferential morale of this ancient story for analysts is that we should never assume that appeasing our model with multiple sources of causal influences will result in a happy ending of results.

The Cassiopeia Effect is the result of our hubris, our possible arrogance that our models are sacrosanct. They are not. They are indeed very fragile. Also, in the story, no one expected Perseus to arrive on the scene. We might consider as the Perseus Effect the amazingly surprising result of an unexpected intervention, also known as a counterfactual condition.

Cassiopeia labors under the stars

Let’s investigate labor market gender discrimination to build perhaps a more realistic cause and effect model. Here $D$ is gender discrimination, something not measured, and thus acts as an unobserved variable. We do observe gender with $F$ as female, and some sort of measure of ability, such as years of schooling or scores on a test, with $A$ as ability. We observe labor market outcomes as $W$ for wages.

The theory posits that wages depend on discrimination. The model assumes that gender ($F$ is reporting as female) influences both occupational decisions by labor suppliers and wage determination through discriminatory sorting of labor sources. we observe occupational quality, gender, ability, and wages. We do not directly observe discrimination in the data. Both wages and occupational choices depend on ability. This all makes a lot of theoretical sense. But do any of these forces confuse and confound the path of discrimination to wages?

Let’s map our story so far.

tidy_ggdag <- dagify(
  D ~ F,
  O ~ A,
  O ~ F,
  O ~ D,
  W ~ D,
  W ~ O,
  W ~ A
) %>% 
  tidy_dagitty()
ggdag( tidy_ggdag ) + 
  theme_dag()

$Causal impact on wages: occupation and ability impact wage sensitivity to gender. \label{fig:dag-all}$

Causal impact on wages: occupation and ability impact wage sensitivity to gender.

The shape of this causal map mimics the constellation of Cassiopeia. We might be encouraged by conjecture, and the so-called hubris of Cassiopeia, to regress wages on all of the other influences which collide into wages. However we have two forked confounders, $O \leftarrow F \rightarrow O$ and $O \leftarrow A \rightarrow W$, and one collider, $F \rightarrow O \leftarrow D$, on the way to wages. These amount to three very sneaky backdoor passes to confusion about how discrimination can cause wages.

A generative model will prod us

We will form and examine generative models to challenge our causal assumption. Here is a generative model of our discrimination analysis. Occupational decisions are plausibly abridged by discrimination, thus a negative relationship. Discrimination also negatively impacts wages through channels of sub-optimized labor choices. All other relationships are positive. The influence of being female on occupational choice has been eliminated here since this is one backdoor confounder on the path to wages.¹⁰

\[\begin{align} F &\sim \operatorname{Bernoulli}( p ) \\ D &= F \\ A &\sim \operatorname{Normal}( 0,\, 1) \\ O &\sim \operatorname{Normal}( 1+2A+0F-2D, \sigma) \\ W &\sim \operatorname{Normal}( 1-1D+1O+2A, \sigma) \end{align}\]

The 1-coefficients indicate a single arrow and, thus influence, the 2-coefficients indicate two arrows on paths into $W$. For example there are 2 paths to $W$ from $A$, one directly through $O$, and one directly not through $O$. There is only 1 path from $D$ to $W$, but 2 paths to $W$ if we also include $O$. There are 4 paths in total each into $W$ and $O$. Finally, since $F$ is the indicator of $D$, we might think we can detect $D$ using just $F$. This should be true, however, is $F \rightarrow D \rightarrow W$ going to be magnified or diminished by $O$ and $A$, otherwise known as bias?

library(tidyverse)
library(rethinking)
library(stargazer)
#
N <- 100
data <- tibble(
  F = rbern( N ),
  D = F,
  A = rnorm( N ),
  O = 1 + 2*A + 0*F - 2*D + rnorm( N ),
  W = 1 - 1*D + 1*O + 2*A + rnorm( N ) 
)
#
lm_1 <- lm(W ~ F, data)
lm_2 <- lm(W ~ F + O, data)
lm_3 <- lm(W ~ F + O + A, data)
#
stargazer(lm_1,lm_2,lm_3, type = "text", column.labels = c("Biased Unconditional",              "Biased",                     "Unbiased Conditional"))

## 
## ==========================================================================================
##                                              Dependent variable:                          
##                     ----------------------------------------------------------------------
##                                                       W                                   
##                      Biased Unconditional          Biased           Unbiased Conditional  
##                              (1)                     (2)                     (3)          
## ------------------------------------------------------------------------------------------
## F                         -3.000***                0.640*                 -0.950***       
##                            (0.960)                 (0.320)                 (0.310)        
##                                                                                           
## O                                                 1.900***                1.000***        
##                                                    (0.063)                 (0.110)        
##                                                                                           
## A                                                                         2.200***        
##                                                                            (0.250)        
##                                                                                           
## Constant                   2.700***                0.390*                 1.200***        
##                            (0.660)                 (0.220)                 (0.190)        
##                                                                                           
## ------------------------------------------------------------------------------------------
## Observations                 100                     100                     100          
## R2                          0.093                   0.910                   0.950         
## Adjusted R2                 0.083                   0.910                   0.950         
## Residual Std. Error    4.800 (df = 98)         1.500 (df = 97)         1.100 (df = 96)    
## F Statistic         10.000*** (df = 1; 98) 496.000*** (df = 2; 97) 603.000*** (df = 3; 96)
## ==========================================================================================
## Note:                                                          *p<0.1; **p<0.05; ***p<0.01

The first model would have us believe that $F$ influences $W$ more than 3 times the impact than what is is in the data. This model is definitely biased. It is also the right sign so we may have appeased our theory, but over-emphasized the role of discrimination in the presence of other influences such as occupational quality and ability.

The second model conditions both $F$ and $O$ so that $W$ seems to increase with $F$ and $O$. The sign of $O$ is correct, higher-quality occupations determine higher wages. The sign of $F$ is not the direction implied by the data, females have a negative effect on wages in the data. This bias comes from the netting of negative discrimination and positive ability through occupational quality on to wages.

A third model has the correct sign and size of influence of $F \rightarrow D \rightarrow W$ discrimination on wages. This model also correctly represents, within sampling error, the size and sign of $O$ and $A$ on $W$.

That was the conventional OLS approach. Here are the fully probabilistic models. The coefficients are carbon copies of the OLS models. These estimations will allow us to assess the relative plausibilities of each model as well as their informativeness.

m_1 <- quap( alist(
  W ~ dnorm( muW, sigma ),
  muW <- aW + bWF*F,
  aW ~ dnorm( 0, 1),
  c( bWF ) ~ dnorm( 0, 1),
  sigma ~ dexp( 1 )
), data = data )
precis( m_1 )

##       mean   sd 5.5% 94.5%
## aW     1.5 0.50  0.7   2.3
## bWF   -1.3 0.67 -2.3  -0.2
## sigma  4.7 0.33  4.2   5.2

Now for the second model.

m_2 <- quap( alist(
  W ~ dnorm( muW, sigma ),
  muW <- aW + bWF*F + bWO*O,
  aW ~ dnorm( 0, 1),
  c( bWF, bWO) ~ dnorm( 0, 1),
  sigma ~ dexp( 1 )
), data = data )
precis( m_2 )

##       mean    sd  5.5% 94.5%
## aW    0.41 0.205 0.079  0.74
## bWF   0.58 0.300 0.104  1.06
## bWO   1.87 0.061 1.769  1.96
## sigma 1.47 0.103 1.305  1.63

Here is the third model.

m_3 <- quap( 
  alist(
    W ~ dnorm( muW, sigma ),
    muW <- aW + bWF*F + bWO*O + bWA*A,
    aW ~ dnorm( 0, 1),
    c( bWF, bWO, bWA ) ~ dnorm( 0, 1),
    sigma ~ dexp( 1 )
), data = data )
precis( m_3 )

##        mean    sd  5.5% 94.5%
## aW     1.07 0.177  0.79  1.35
## bWF   -0.77 0.283 -1.22 -0.32
## bWO    1.07 0.104  0.91  1.24
## bWA    2.00 0.234  1.63  2.38
## sigma  1.11 0.078  0.98  1.23

With each of these models we then pull samples of each of the impact parameters (all unobserved data) to compare and contrast their locations and shapes.

library(ggridges)

without_O <- extract.samples( m_1 )$bWF #[ ,1 ]
with_O <- extract.samples( m_2 )$bWF #[ ,2 ]
with_O_A <- extract.samples( m_3 )$bWF #[ ,2 ]

plot_extract <- tibble( 
  sim =1:10000, 
  without_O, 
  with_O,
  with_O_A) %>% 
  pivot_longer( 
    -sim, 
    names_to = "model", 
    values_to = "bWF") 

plot_extract %>%   
  ggplot( aes( bWF, model) ) +
    geom_density_ridges() +
    scale_fill_viridis_d() +
    theme(legend.position = "bottom") +
    labs(title = "Predicted variation of slope sensitivity",
         subtitle = "Occupation and Ability impact Wage sensitivity to gender.")

$Predicted impact on wages: occupation and ability impact Wage sensitivity to gender. \label{fig:occ-able-slope}$

Predicted impact on wages: occupation and ability impact Wage sensitivity to gender.

Now for the variability estimated for each model. This parameter is analogous to the standard error of the OLS regression.

library(ggridges)
without_O <- extract.samples( m_1 )$sigma #[ ,1 ]
with_O <- extract.samples( m_2 )$sigma #[ ,2 ]
with_O_A <- extract.samples( m_3 )$sigma #[ ,2 ]

plot_extract <- tibble( 
  sim =1:10000, 
  without_O, 
  with_O,
  with_O_A) %>% 
  pivot_longer( 
    -sim, 
    names_to = "model", 
    values_to = "sigma") 

plot_extract %>%   
  ggplot( aes( sigma, model) ) +
    geom_density_ridges() +
    scale_fill_viridis_d() +
    theme(legend.position = "bottom")

$Predicted variation of wages: accupation and ability impact wage sensitivity to gender. \label{fig:occ-able-sigma}$

Predicted variation of wages: accupation and ability impact wage sensitivity to gender.

These probabilistic models point to differences missed by the OLS approach. The $\sigma$ credibility intervals not only get successively narrower in range, but also the size of $\sigma$ gets appreciably smaller as we add ability, $A$, to the mix of variables. We might be tempted to use the kitchen sink of all variables to improve the overall explanatory power of the model.

Some regression diagnostics are in order, even with our generative data. One more plot for another comparison: Pareto Smoothed Importance Sampling (PSIS) calculated pointwise will help us identify outlier differences among models. PSIS exploits the distribution of potentially outlying and influential observations using the Generalized Pareto distribution to model and measure the data point-wise with the shape parameter $k=\xi$. Any point with $k>0.5$ will have infinite variance and thus contribute to a concentration of points in an ever-thickening tail of the distribution of uncertainty about the relationships we are modeling.

## R code 7.34 McElreath2020
#library( plotly )
options( digits=2, scipen=999999)
d <- data
set.seed(4284)
PSIS_m3 <- PSIS( m_3, pointwise=TRUE )
PSIS_m3 <- cbind( PSIS_m3, wage=d$W, occupation=d$O, ability=d$A, gender=d$F)
set.seed(4284)
#WAIC_m2.2 <- WAIC(m2.2,pointwise=TRUE)
p1 <- PSIS_m3 |>
  ggplot( aes( x=k, y=penalty, group=wage, color=gender )) +
  geom_point( shape=21 ) + 
  xlab("PSIS Pareto k") +
  ylab("PSIS penalty") + 
  geom_vline( xintercept = 0.03, linetype = "dashed")
p1

$Highly influential observations and out-of-sample prediction. Male observations inhabit the NE quadrant with high penalty and Pareto k values. These observations are highly unpredictable. \label{fig:psis-k-penalty}$

Highly influential observations and out-of-sample prediction. Male observations inhabit the NE quadrant with high penalty and Pareto k values. These observations are highly unpredictable.

The cautionary tale here is that we also need to consider the three potential biases: sign and size of the influence of a proposed variable on outcomes as well as the role and influence of outliers.

References

Devlin, Keith. 2010. The Unfinished Game: Pascal, Fermat, and the Seventeenth-Century Letter That Made the World Modern. Basic Books.

Knuth, Donald E. 1993. “Johann Faulhaber and Sums of Powers.” Mathematics of Computation 61 (203): 277–94.

Kullback, Solomon. 1951. “Kullback-Leibler Divergence.”

Nettle, Daniel. 1998. “Explaining Global Patterns of Language Diversity.” Journal of Anthropological Archaeology 17 (4): 354–74.

Shannon, Claude Elwood. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.

Watanabe, S. 2009. Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press. https://books.google.com/books?id=GnnUHasAtUQC.

Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (116): 3571–94. http://jmlr.org/papers/v11/watanabe10a.html.

Wildberger, NJ. 2021. “Algebraic Calculus One.” https://www.youtube.com/watch?v=8H2AcVIACRo&list=PLIljB45xT85CSlgGh36810KdTcOd21JFh&index=31.

Zellner, Arnold. 1971. An Introduction to Bayesian Inference in Econometrics. John Wiley; Sons, Inc.

I first saw this decision frame in Zellner (1971). I used to use it in my managerial economcis courses especially for executives. The umbrella example is implicit in an analysis by Blaise Pascal in a letter to Pierre de Fermat translated in Devlin (2010), p. 171-179. I attended a workshop presented by Fred James, CERN, through the Fermi National Accelerator Laboratory in January 2000 with notes at https://conferences.fnal.gov/cl2k/fredjameslectures.pdf where the umbrella decision is analyzed.↩︎
Much more interesting is the use of a perceptual manifold, and even more daunting are the psychological, sociological, cognitional, epistemological, metaphysical manifolds of information development, transmission, reception and processing. For now we will content ourselves with the relatively simple biological model.↩︎
The anatomical elements of physical communication between two persons highlights the beginning of an understanding of the immense complexity of communication and the artifact, the information, being communicated. Information thus is not simply yes and/or no, it is the sequence of “yesses” and “noes.”↩︎
Even the information carrying characteristics of human languages eventually at one time or another, or until some cataclysm changes the dynamics, a stable set of tokens, rules, grammars, semantics. (Nettle (1998)).↩︎
Knuth (1993), p. 88, notes that Johann Faulhaber discovered the derivative. This would have been approximately in 1631 in Faulhaber’s book Academia Algebrae https://maa.org/press/periodicals/convergence/mathematical-treasures-johann-faulhabers-academia-algebrae, about 30 years before Newton and 50 years before Leibnitz.↩︎
A sub-derivative $D_k H(p)$ of order $k$ relates to the usual calculus derivative $d^kH(p)/dp^k=D^kH(p)$ through this expression: \[ D_k H(p) = \frac{D^k H(p)}{k!} \] where sub also relates to the subscript of the order of the derivative. See Wildberger (2021) for details about the Lagrangian approach to an algebraic calculus on polynomial (polynumbers). The definition of the $\operatorname{Log}()$, truncated at the quadratic term, is \[ \operatorname{Log}(p)= (p-1) - \frac{1}{2}(p-1)^2 \] We then can express $H(p)$ in this way \[ H(p)=-[p((p-1) - \frac{1}{2}(p-1)^2) + (1-p)((p - \frac{1}{2}p^2)] \] since there are only two possibilities. ↩︎
waic-names: Also we should be sure to distinguish singular from regular learning models as in S. Watanabe (2009). The rendering of WAIC here follows very closely, sometimes exactly, (McelreathStatisticalRethinkingBayesian2020?)’s excellent examples and exposition.↩︎
Again we refer to S. Watanabe (2009) for the details around KL, especially when the Fisher Information Matrix itself is not invertible and therefore only semi-positive definite and thus the underlying observational model is locally singular. Singularity here implies that there is somewhere in the unobserved data space of so-called free parameters where inference using the Fisher Information Matrix is not valid, certainly not computable.↩︎
We can try tis sapply() calculation perhaps with a tidyverse version like map()?↩︎
We use what I now consider to be a canonical model of causality introduced by Richard McElreath in a series of lectures all posted here: https://github.com/rmcelreath/causal_salad_2021. Any omissions, reversals, and other blemishes are of my own construction, not Richard’s.↩︎

Inference above and Beyond

Bill Foote