Daniel Nettle is a behavioral scientist who specializes in linguistic aspects of human development. His 1998 paper on language diversity, which we can access here poses this question.
This paper, then, asks the question of, what, in general, determines the size of language communities found in a human population.
Language is a mask for regional integration or disintegration of societies. Powerful forces can unite, or divide, whole groups of people, often identifiable by the language of their region of origin. Food security is one of several fundamental sources of movements of people from one region to another. Ecological risk, as measured by the length of the mean growing season in a region, correlates with food security. Prominently for the time, Nettle published his paper with the original data he used to conduct the analysis.
The values in data(nettle)
are data on language
diversity in 74 nations. The meaning of each column is given below.
Use these data to evaluate the hypothesis that language diversity is partly a product of food security. The notion is that, in productive ecologies, people don’t need large social networks to buffer them against a risk of food shortfalls. This means cultural groups can be smaller and more self-sufficient,leading to more languages per capita. In this way we use the number of languages per capita as the outcome.
d$lang.per.cap <- d$num.lang / d$k.pop
Use the logarithm of this new variable as your regression outcome. We can continue this model using counts and the enhanced Poisson distribution. For now we use the similarly convergent Gaussian distribution.
We will evaluate the effects of both mean.growing.season
and sd.growing.season
, as well as their two-way
interaction.
Here are three parts to help.
First, we examine the hypothesis that language diversity, as
measured by log(lang.per.cap)
, is positively associated
with the average length of the growing season,
mean.growing.season
. Consider log(area)
in the
models(s) as a covariate (not an interaction).
Next we consider the hypothesis that language diversity is
negatively associated with the standard deviation of length of growing
season, sd.growing.season
. This hypothesis follows from
uncertainty in harvest favoring social insurance through larger social
networks and therefore fewer languages. Again, consider
log(area)
as a covariate (not an interaction). Interpret
your results.
Finally, we assess the hypothesis that
mean.growing.season
and sd.growing.season
interact to synergistically reduce language diversity. The notion is
that, in nations with longer average growing seasons, high variance
makes storage and redistribution even more important than it would be
otherwise. That way, people can cooperate to preserve and protect
harvest windfalls to be used during the droughts.
We fit three models to begin to analyze the hypothesis
Language diversity, as measured by
log(lang.per.cap)
, is positively associated with the average length of the growing season,mean.growing.season
.
The models:
We attempt to predict log_lang_per_capita
(\(L\)) naively with a constant and no other
predictors.
We build on the naive model with only mean growing season (\(M\)) as a predictor.
We can expand on this model by including log(area)
(\(A\)) to mediate growing season or to
independently influence language diversity.
The mediation model has this formulation.
\[ \begin{align} L & \sim \operatorname{Normal}(\mu_L, \sigma_L) \\ \mu_L & = \alpha_L + \beta_{LM} M + \beta_{LA} A\\ \alpha_L & \sim \operatorname{Normal}(median(L),2) \\ \beta_{LM} & \sim \operatorname{Normal}(0,2) \\ \beta_{LA} & \sim \operatorname{Normal}(0,2) \\ \sigma_L & \sim \operatorname{Exponential}(1) \\ M & \sim \operatorname{Normal}(\mu_M, \sigma_M) \\ \mu_M & = \alpha_M + \beta_{M} A\\ \alpha_M & \sim \operatorname{Normal}(median(M),2) \\ \beta_M & \sim \operatorname{Normal}(0,2) \\ \sigma_M & \sim \operatorname{Exponential}(1) \\ \end{align} \]
In this model we literally run a regression (\(\mu_M = \alpha_M + \beta_{M} A\)) to help with an other regression (\(\mu_L = \alpha_L + \beta_{LM} M + \beta_{LA} A\)) where \(\mu_M, \sigma_M\) mediate movements in the main attraction \(\mu_L, \sigma_L\).
All of this will deserve a comparison. We will use WAIC. A lower WAIC in one model indicates a lower loss of information measured as deviancy relative to other models. WAIC, similar to the likelihood ratio, the F-test, and even \(R^2\), is a predictive statistical inference of a sort. Adding more variates increases the complexity of the models and likely will decrease information loss.
A second model set would involve the standard deviation of the growing season. A third model set would divide the world into meaningful regions to examine the differences and similarities of language diversity globally.
We begin with the data.
library(rethinking)
data(nettle)
d <- nettle
summary( d )
## country num.lang area k.pop
## Algeria : 1 Min. : 1.00 Min. : 12189 Min. : 102
## Angola : 1 1st Qu.: 17.25 1st Qu.: 167708 1st Qu.: 3829
## Australia : 1 Median : 40.00 Median : 434796 Median : 9487
## Bangladesh: 1 Mean : 89.73 Mean : 880698 Mean : 33574
## Benin : 1 3rd Qu.: 93.75 3rd Qu.:1080316 3rd Qu.: 24745
## Bolivia : 1 Max. :862.00 Max. :8511965 Max. :849638
## (Other) :68
## num.stations mean.growing.season sd.growing.season
## Min. : 1.00 Min. : 0.000 Min. :0.0000
## 1st Qu.: 10.00 1st Qu.: 5.348 1st Qu.:0.9375
## Median : 20.50 Median : 7.355 Median :1.6900
## Mean : 37.91 Mean : 7.041 Mean :1.6992
## 3rd Qu.: 44.75 3rd Qu.: 9.283 3rd Qu.:2.1075
## Max. :272.00 Max. :12.000 Max. :5.8700
##
We note:
There are 74 countries represented.
The sample exhibits widely varying language diversity, area, and population.
We note the extreme distances between the 75th quantiles and the maximum. Scaling will be important in the regularization of the model.
d$L <- log( d$num.lang / d$k.pop )
d$A <- log(d$area)
d$M <- d$mean.growing.season
#
# m-models for the mean.growing.season
m0 <- quap(
alist(
L ~ dnorm(mu,sigma),
mu <- aL + 0,
aL ~ dnorm(5,1.5),
sigma ~ dexp(1)
) , data=d )
# Only M
m1 <- quap(
alist(
L ~ dnorm(mu,sigma),
mu <- aL + bLM*M,
aL ~ dnorm(-5,1.5),
bLM ~ dnorm(0,0.5),
sigma ~ dexp(1)
) , data=d )
#
# Independent M and A
# M->L<-A
m2 <- quap(
alist(
L ~ dnorm(mu,sigma),
mu <- aL + bLM*M + bLA*A,
aL ~ dnorm(-5,1.5),
c(bLM,bLA) ~ dnorm(0,0.5),
sigma ~ dexp(1)
) , data=d )
#
# A mediating M
# A->G->L
#
m3 <- quap(
alist(
L ~ dnorm(mu,sigma),
mu <- aL + bLM*M + bLA*A,
aL ~ dnorm(-5,1.5),
c(bLM,bLA) ~ dnorm(0,0.5),
sigma ~ dexp(1),
M ~ dnorm(muM, sigmaM),
muM <- aM + bM*A,
aM ~ dnorm(7.4,2),
bM ~ dnorm(0,0.5),
sigmaM ~ dexp(1)
) , data=d, )
#
precis( m0, digits = 4, depth = 2)
## mean sd 5.5% 94.5%
## aL -5.316868 0.1748574 -5.596324 -5.037412
## sigma 1.501725 0.1226248 1.305747 1.697703
precis( m1, digits = 4, depth = 2)
## mean sd 5.5% 94.5%
## aL -6.5582826 0.38590283 -7.17502983 -5.941535
## bLM 0.1590494 0.05033912 0.07859774 0.239501
## sigma 1.3971365 0.11338212 1.21592998 1.578343
precis( m2, digits = 4, depth = 2)
## mean sd 5.5% 94.5%
## aL -4.6211012 1.17312650 -6.49598391 -2.74621846
## bLM 0.1532642 0.04976670 0.07372742 0.23280101
## bLA -0.1483655 0.08485312 -0.28397717 -0.01275384
## sigma 1.3779374 0.11179669 1.19926469 1.55661010
precis( m3, digits = 4, depth = 2)
## mean sd 5.5% 94.5%
## aL -4.6185998 1.17313682 -6.4934991 -2.74370063
## bLM 0.1532266 0.04976838 0.0736871 0.23276607
## bLA -0.1485246 0.08485416 -0.2841379 -0.01291123
## sigma 1.3779891 0.11180770 1.1992988 1.55667937
## aM 9.9346210 1.71495266 7.1937955 12.67544662
## bM -0.2295071 0.13363964 -0.4430890 -0.01592513
## sigmaM 2.9699073 0.24155213 2.5838604 3.35595428
compare( m0, m1, m2, m3)
## WAIC SE dWAIC dSE pWAIC weight
## m2 266.6883 15.76641 0.0000000 NA 4.120060 0.417057791
## m3 267.0031 15.87093 0.3147907 0.2271039 4.297850 0.356320085
## m1 267.9250 15.59487 1.2366921 2.9128551 3.664950 0.224725296
## m0 277.4744 17.12809 10.7860833 7.3284663 3.208032 0.001896828
plot( coeftab( m0, m1, m2, m3 ) )
More covariates usually results in less deviancy and more predictivity. But very much more?
The naive model, m0
, is least predictive.
The two complicated models, m2
and m3
,
are plausibly tied for second place.
The mid-complex model, m1
, is the least deviant from
an information loss point of view.
Here we visualize the predictions with WAIC’s cousin PSIS. Both use the Bayesian predictive likelihood. PSIS here will also provide cross-validation with the leave-one-out procedure. The horizontal axis measures the degree of unpredictability using the Pareto power \(k\) parameter. The vertical axis measures the volatility of observations using the penalty component of the PSIS calculation.
set.seed(4284)
m <- m0
model_name <- format( m@formula[[2]])
PSIS_m <- PSIS( m, pointwise=TRUE )
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
PSIS_m <- cbind( PSIS_m,
country = d$country
)
p0_psis <- PSIS_m |>
ggplot( aes( x=k, y=penalty ) ) +
geom_point( shape=21, color="blue" ) +
geom_vline( xintercept = 0.2, linetype = "dashed") +
geom_text(aes( label = ifelse((k > 0.2)&(penalty > 0.2), as.character(country),'')), hjust = 1, vjust = 1) +
labs( title = "Bias",
subtitle = model_name,
x = ("Uncertainty: PSIS Pareto k"),
y = ("Volatility: PSIS penalty"))
#p1 #ggplotly( p1 )
#
m <- m0
model_name <- format( m@formula[[2]] )
post <- extract.samples( m ) # pull 3 columns of sampled parms
M_seq <- seq( from=0, to=12, length.out=length(d$M)) # evenly spaced A and M
mu_link <- function(M){
# revolve around median area
mu <- post$aL #+ post$bLM*M + post$bLA*13
}
mu <- sapply( M_seq, mu_link ) # calculated line
mu_mean <- apply( mu, 2, mean ) # make averages
mu_PI <- apply( mu, 2, PI, prob=0.91) # make probability intervals
#
#summary( mu_CI )
post_tbl <- tibble(
M_seq = M_seq,
lower = t(mu_PI)[,1],
upper = t(mu_PI)[,2],
mean = mu_mean,
L = d$L,
M = d$M,
A = 13
)
# plot
p0_pred <- post_tbl |>
ggplot( aes( x=M_seq, y=L ) )+
geom_point() +
geom_line( aes( x=M_seq, y=mean ),
color="red" ) +
geom_ribbon( aes( ymin=lower, ymax=upper ),
alpha=0.1, fill = "blue",
color = "red",
linetype = "dotted" ) +
labs( x = "Mean growing season",
y = "Number of languages",
title = "91% Predictive Intervals",
subtitle = model_name
)
# p0_pred
Commentary
(Note structural and statistical differences from the other models.)
# Run Model m1 PSIS here
#
#
# p1_psis
# Run Model m1 Predictive Probability Intervals here
#
#
# p1_psis
Commentary
(Note structural and statistical differences from the other models.)
# Run Model m1 PSIS here
#
#
# p2_psis
# Run Model m1 Predictive Probability Intervals here
#
#
# p2_pred
Commentary
(Note structural and statistical differences from the other models.)
set.seed(4284)
m <- m3
model_name <- format( m@formula[[2]])
PSIS_m <- PSIS( m, pointwise=TRUE )
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
PSIS_m <- cbind( PSIS_m,
country = d$country
)
p3_psis <- PSIS_m |>
ggplot( aes( x=k, y=penalty ) ) +
geom_point( shape=21, color="blue" ) +
geom_vline( xintercept = 0.2, linetype = "dashed") +
geom_text(aes( label = ifelse((k > 0.2)&(penalty > 0.2), as.character(country),'')), hjust = 1, vjust = 1) +
labs( title = "Bias",
subtitle = model_name,
x = ("Uncertainty: PSIS Pareto k"),
y = ("Volatility: PSIS penalty"))
#p3 #ggplotly( p1 )
#
m <- m3
model_name <- format( m@formula[[2]] )
post <- extract.samples( m ) # pull 3 columns of sampled parms
M_seq <- seq( from=0, to=12, length.out=length(d$M)) # evenly spaced A and M
mu_link <- function(M){
# revolve around median area
mu <- post$aL + post$bLM*M + post$bLA*13
}
mu <- sapply( M_seq, mu_link ) # calculated line
mu_mean <- apply( mu, 2, mean ) # make averages
mu_PI <- apply( mu, 2, PI, prob=0.91) # make probability intervals
#
#summary( mu_CI )
post_tbl <- tibble(
M_seq = M_seq,
lower = t(mu_PI)[,1],
upper = t(mu_PI)[,2],
mean = mu_mean,
L = d$L,
M = d$M,
A = 13
)
# plot
p3_pred <- post_tbl |>
ggplot( aes( x=M_seq, y=L ) )+
geom_point() +
geom_line( aes( x=M_seq, y=mean ),
color="red" ) +
geom_ribbon( aes( ymin=lower, ymax=upper ),
alpha=0.1, fill = "blue",
color = "red",
linetype = "dotted" ) +
labs( x = "Mean growing season",
y = "Number of languages",
title = "91% Predictive Intervals",
subtitle = model_name
)
# p3_pred
Commentary
(Note structural and statistical differences from the other models.)
Arrange plots in a grid using the patchwork package for ease of comparison.
library(patchwork)
# p0_psis | p0_pred
# for example with both models 0 and 1
(p0_psis / p3_psis) | (p0_pred / p3_pred)