So many languages

Daniel Nettle is a behavioral scientist who specializes in linguistic aspects of human development. His 1998 paper on language diversity, which we can access here poses this question.

This paper, then, asks the question of, what, in general, determines the size of language communities found in a human population.

Language is a mask for regional integration or disintegration of societies. Powerful forces can unite, or divide, whole groups of people, often identifiable by the language of their region of origin. Food security is one of several fundamental sources of movements of people from one region to another. Ecological risk, as measured by the length of the mean growing season in a region, correlates with food security. Prominently for the time, Nettle published his paper with the original data he used to conduct the analysis.

The values in data(nettle) are data on language diversity in 74 nations. The meaning of each column is given below.

  1. country: Name of the country
  2. num.lang: Number of recognized languages spoken
  3. area: Area in square kilometers
  4. k.pop: Population, in thousands
  5. num.stations: Number of weather stations that provided data for the next two columns
  6. mean.growing.season: Average length of growing season, in months
  7. sd.growing.season: Standard deviation of length of growing season, in months

Use these data to evaluate the hypothesis that language diversity is partly a product of food security. The notion is that, in productive ecologies, people don’t need large social networks to buffer them against a risk of food shortfalls. This means cultural groups can be smaller and more self-sufficient,leading to more languages per capita. In this way we use the number of languages per capita as the outcome.

d$lang.per.cap <- d$num.lang / d$k.pop

Use the logarithm of this new variable as your regression outcome. We can continue this model using counts and the enhanced Poisson distribution. For now we use the similarly convergent Gaussian distribution.

We will evaluate the effects of both mean.growing.season and sd.growing.season, as well as their two-way interaction.

Here are three parts to help.

A first model set

We fit three models to begin to analyze the hypothesis

Language diversity, as measured by log(lang.per.cap), is positively associated with the average length of the growing season, mean.growing.season.

The models:

  1. We attempt to predict log_lang_per_capita (\(L\)) naively with a constant and no other predictors.

  2. We build on the naive model with only mean growing season (\(M\)) as a predictor.

  3. We can expand on this model by including log(area) (\(A\)) to mediate growing season or to independently influence language diversity.

The mediation model has this formulation.

\[ \begin{align} L & \sim \operatorname{Normal}(\mu_L, \sigma_L) \\ \mu_L & = \alpha_L + \beta_{LM} M + \beta_{LA} A\\ \alpha_L & \sim \operatorname{Normal}(median(L),2) \\ \beta_{LM} & \sim \operatorname{Normal}(0,2) \\ \beta_{LA} & \sim \operatorname{Normal}(0,2) \\ \sigma_L & \sim \operatorname{Exponential}(1) \\ M & \sim \operatorname{Normal}(\mu_M, \sigma_M) \\ \mu_M & = \alpha_M + \beta_{M} A\\ \alpha_M & \sim \operatorname{Normal}(median(M),2) \\ \beta_M & \sim \operatorname{Normal}(0,2) \\ \sigma_M & \sim \operatorname{Exponential}(1) \\ \end{align} \]

In this model we literally run a regression (\(\mu_M = \alpha_M + \beta_{M} A\)) to help with an other regression (\(\mu_L = \alpha_L + \beta_{LM} M + \beta_{LA} A\)) where \(\mu_M, \sigma_M\) mediate movements in the main attraction \(\mu_L, \sigma_L\).

All of this will deserve a comparison. We will use WAIC. A lower WAIC in one model indicates a lower loss of information measured as deviancy relative to other models. WAIC, similar to the likelihood ratio, the F-test, and even \(R^2\), is a predictive statistical inference of a sort. Adding more variates increases the complexity of the models and likely will decrease information loss.

A second model set would involve the standard deviation of the growing season. A third model set would divide the world into meaningful regions to examine the differences and similarities of language diversity globally.

We begin with the data.

library(rethinking)
data(nettle)
d <- nettle
summary( d )
##        country      num.lang           area             k.pop       
##  Algeria   : 1   Min.   :  1.00   Min.   :  12189   Min.   :   102  
##  Angola    : 1   1st Qu.: 17.25   1st Qu.: 167708   1st Qu.:  3829  
##  Australia : 1   Median : 40.00   Median : 434796   Median :  9487  
##  Bangladesh: 1   Mean   : 89.73   Mean   : 880698   Mean   : 33574  
##  Benin     : 1   3rd Qu.: 93.75   3rd Qu.:1080316   3rd Qu.: 24745  
##  Bolivia   : 1   Max.   :862.00   Max.   :8511965   Max.   :849638  
##  (Other)   :68                                                      
##   num.stations    mean.growing.season sd.growing.season
##  Min.   :  1.00   Min.   : 0.000      Min.   :0.0000   
##  1st Qu.: 10.00   1st Qu.: 5.348      1st Qu.:0.9375   
##  Median : 20.50   Median : 7.355      Median :1.6900   
##  Mean   : 37.91   Mean   : 7.041      Mean   :1.6992   
##  3rd Qu.: 44.75   3rd Qu.: 9.283      3rd Qu.:2.1075   
##  Max.   :272.00   Max.   :12.000      Max.   :5.8700   
## 

We note:

  • There are 74 countries represented.

  • The sample exhibits widely varying language diversity, area, and population.

  • We note the extreme distances between the 75th quantiles and the maximum. Scaling will be important in the regularization of the model.

Models

d$L <- log( d$num.lang / d$k.pop )
d$A <- log(d$area) 
d$M <- d$mean.growing.season
#
# m-models for the mean.growing.season
m0 <- quap(
  alist(
    L ~ dnorm(mu,sigma),
    mu <- aL + 0,
    aL ~ dnorm(5,1.5),
    sigma ~ dexp(1)
  ) , data=d )
# Only M
m1 <- quap(
  alist(
    L ~ dnorm(mu,sigma),
    mu <- aL + bLM*M,
    aL ~ dnorm(-5,1.5),
    bLM ~ dnorm(0,0.5),
    sigma ~ dexp(1)
  ) , data=d )
#
# Independent M and A
# M->L<-A
m2 <- quap(
  alist(
    L ~ dnorm(mu,sigma),
    mu <- aL + bLM*M + bLA*A,
    aL ~ dnorm(-5,1.5),
    c(bLM,bLA) ~ dnorm(0,0.5),
    sigma ~ dexp(1)
  ) , data=d )
#
# A mediating M
# A->G->L
# 
m3 <- quap(
  alist(
    L ~ dnorm(mu,sigma),
    mu <- aL + bLM*M + bLA*A,
    aL ~ dnorm(-5,1.5),
    c(bLM,bLA) ~ dnorm(0,0.5),
    sigma ~ dexp(1),
    M ~ dnorm(muM, sigmaM),
    muM <- aM + bM*A,
    aM ~ dnorm(7.4,2),
    bM ~ dnorm(0,0.5),
    sigmaM ~ dexp(1)
  ) , data=d,  )
#
precis( m0, digits = 4, depth = 2)
##            mean        sd      5.5%     94.5%
## aL    -5.316868 0.1748574 -5.596324 -5.037412
## sigma  1.501725 0.1226248  1.305747  1.697703
precis( m1, digits = 4, depth = 2)
##             mean         sd        5.5%     94.5%
## aL    -6.5582826 0.38590283 -7.17502983 -5.941535
## bLM    0.1590494 0.05033912  0.07859774  0.239501
## sigma  1.3971365 0.11338212  1.21592998  1.578343
precis( m2, digits = 4, depth = 2)
##             mean         sd        5.5%       94.5%
## aL    -4.6211012 1.17312650 -6.49598391 -2.74621846
## bLM    0.1532642 0.04976670  0.07372742  0.23280101
## bLA   -0.1483655 0.08485312 -0.28397717 -0.01275384
## sigma  1.3779374 0.11179669  1.19926469  1.55661010
precis( m3, digits = 4, depth = 2)
##              mean         sd       5.5%       94.5%
## aL     -4.6185998 1.17313682 -6.4934991 -2.74370063
## bLM     0.1532266 0.04976838  0.0736871  0.23276607
## bLA    -0.1485246 0.08485416 -0.2841379 -0.01291123
## sigma   1.3779891 0.11180770  1.1992988  1.55667937
## aM      9.9346210 1.71495266  7.1937955 12.67544662
## bM     -0.2295071 0.13363964 -0.4430890 -0.01592513
## sigmaM  2.9699073 0.24155213  2.5838604  3.35595428
compare( m0, m1, m2, m3)
##        WAIC       SE      dWAIC       dSE    pWAIC      weight
## m2 266.6883 15.76641  0.0000000        NA 4.120060 0.417057791
## m3 267.0031 15.87093  0.3147907 0.2271039 4.297850 0.356320085
## m1 267.9250 15.59487  1.2366921 2.9128551 3.664950 0.224725296
## m0 277.4744 17.12809 10.7860833 7.3284663 3.208032 0.001896828
plot( coeftab( m0, m1, m2, m3 ) )

More covariates usually results in less deviancy and more predictivity. But very much more?

Model m0

Here we visualize the predictions with WAIC’s cousin PSIS. Both use the Bayesian predictive likelihood. PSIS here will also provide cross-validation with the leave-one-out procedure. The horizontal axis measures the degree of unpredictability using the Pareto power \(k\) parameter. The vertical axis measures the volatility of observations using the penalty component of the PSIS calculation.

set.seed(4284)
m <- m0
model_name <- format( m@formula[[2]])
PSIS_m <- PSIS( m, pointwise=TRUE )
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
PSIS_m <- cbind( PSIS_m, 
                 country = d$country
                 )
p0_psis <- PSIS_m |> 
  ggplot( aes( x=k, y=penalty ) ) +
  geom_point( shape=21, color="blue" ) + 
  geom_vline( xintercept = 0.2, linetype = "dashed") + 
  geom_text(aes( label = ifelse((k > 0.2)&(penalty > 0.2), as.character(country),'')), hjust = 1, vjust = 1) +
  labs( title = "Bias",
        subtitle = model_name,
        x = ("Uncertainty: PSIS Pareto k"),
        y = ("Volatility: PSIS penalty"))
#p1 #ggplotly( p1 )
#
m <- m0
model_name <- format( m@formula[[2]] )
post <- extract.samples( m ) # pull 3 columns of sampled parms
M_seq <- seq( from=0, to=12, length.out=length(d$M))                                     # evenly spaced A and M
mu_link <- function(M){
  # revolve around median area
  mu <- post$aL #+ post$bLM*M + post$bLA*13
}
mu <- sapply( M_seq, mu_link )          # calculated line
mu_mean <- apply( mu, 2, mean )         # make averages
mu_PI <- apply( mu, 2, PI, prob=0.91)   # make probability intervals

#
#summary( mu_CI )
post_tbl <- tibble(
  M_seq = M_seq,
  lower = t(mu_PI)[,1],
  upper = t(mu_PI)[,2],
  mean = mu_mean,
  L = d$L,
  M = d$M,
  A = 13
)
# plot
p0_pred <- post_tbl |>
  ggplot( aes( x=M_seq, y=L ) )+
  geom_point() +
  geom_line( aes( x=M_seq, y=mean ), 
             color="red" ) +
  geom_ribbon( aes( ymin=lower, ymax=upper ), 
               alpha=0.1, fill = "blue",  
               color = "red", 
               linetype = "dotted" ) +
  labs( x = "Mean growing season",
        y = "Number of languages",
        title = "91% Predictive Intervals",
        subtitle = model_name
        )
# p0_pred

Commentary

(Note structural and statistical differences from the other models.)

Model: m1

# Run Model m1 PSIS here
# 
# 
# p1_psis
# Run Model m1  Predictive Probability Intervals here
# 
# 
# p1_psis

Commentary

(Note structural and statistical differences from the other models.)

Model: m2

# Run Model m1 PSIS here
# 
# 
# p2_psis
# Run Model m1  Predictive Probability Intervals here
# 
# 
# p2_pred

Commentary

(Note structural and statistical differences from the other models.)

Model: m3

set.seed(4284)
m <- m3
model_name <- format( m@formula[[2]])
PSIS_m <- PSIS( m, pointwise=TRUE )
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
PSIS_m <- cbind( PSIS_m, 
                 country = d$country
                 )
p3_psis <- PSIS_m |> 
  ggplot( aes( x=k, y=penalty ) ) +
  geom_point( shape=21, color="blue" ) + 
  geom_vline( xintercept = 0.2, linetype = "dashed") + 
  geom_text(aes( label = ifelse((k > 0.2)&(penalty > 0.2), as.character(country),'')), hjust = 1, vjust = 1) +
  labs( title = "Bias",
        subtitle = model_name,
        x = ("Uncertainty: PSIS Pareto k"),
        y = ("Volatility: PSIS penalty"))
#p3 #ggplotly( p1 )
#
m <- m3
model_name <- format( m@formula[[2]] )
post <- extract.samples( m ) # pull 3 columns of sampled parms
M_seq <- seq( from=0, to=12, length.out=length(d$M))                                     # evenly spaced A and M
mu_link <- function(M){
  # revolve around median area
  mu <- post$aL + post$bLM*M + post$bLA*13
}
mu <- sapply( M_seq, mu_link )          # calculated line
mu_mean <- apply( mu, 2, mean )         # make averages
mu_PI <- apply( mu, 2, PI, prob=0.91)   # make probability intervals

#
#summary( mu_CI )
post_tbl <- tibble(
  M_seq = M_seq,
  lower = t(mu_PI)[,1],
  upper = t(mu_PI)[,2],
  mean = mu_mean,
  L = d$L,
  M = d$M,
  A = 13
)
# plot
p3_pred <- post_tbl |>
  ggplot( aes( x=M_seq, y=L ) )+
  geom_point() +
  geom_line( aes( x=M_seq, y=mean ), 
             color="red" ) +
  geom_ribbon( aes( ymin=lower, ymax=upper ), 
               alpha=0.1, fill = "blue",  
               color = "red", 
               linetype = "dotted" ) +
  labs( x = "Mean growing season",
        y = "Number of languages",
        title = "91% Predictive Intervals",
        subtitle = model_name
        )
# p3_pred

Commentary

(Note structural and statistical differences from the other models.)

Overall assessment

Arrange plots in a grid using the patchwork package for ease of comparison.

library(patchwork)
# p0_psis | p0_pred

# for example with both models 0 and 1
(p0_psis / p3_psis) | (p0_pred / p3_pred)