Math, Grit and Multilevel Models (part 2)

In the previous post we have explored the data visually and dealt with missing data. Here we turn to inference.

Frequentist inference

We first attempt to replicate the results of Sule and colleagues by building a classical regression model. In it, math scores after the intervention (mathscore2) is explained by a number of variables.

We assume math scores are normally distributed at every level of the predictors. In the formula below, \(g_{i}\) is the grit variable, which tells us if a participant is in the treatment group or not. Next, \(X_{i}\) is the matrix with the observed values of the predictor variables and \(\beta\) is the vector with their predictors. They are the adjustment variables that adjust for differeces in the assignment of properties over the control and intervention groups. \(\alpha_{i}\) then is the estimated effect of the treatment. For the moment we treat grit as if it were assigned at the individual level. All values are standardized, so the estimate of the treatment effect is standardized as well. Finally, \(\sigma_{i}\) denotes the error around any one predicted point that arises because the model will not match the data points perfectly.

\[y_{i} \sim N(g_{i}\alpha_{i}+X_{i}\beta +\sigma_{i}),\ for\ \ i\ = 1, 2,\ ...,\ n \]

So what variables should we adjust for? What is in \(X_{i}\)?

The authors write that they “control for gender, the Raven score, class size, and baseline beliefs and test scores in the estimation.” (p. 13)

We run a classical regression model with the specified adjustment variables.

model1 <- lm(mathscore2 ~  male + raven + csize + belief_survey1 + mathscore1 + grit, data=data2)
summary(model1)
## 
## Call:
## lm(formula = mathscore2 ~ male + raven + csize + belief_survey1 + 
##     mathscore1 + grit, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.75378 -0.49601  0.03382  0.54931  2.22676 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.388611   0.132002   2.944   0.0033 ** 
## male1             0.005715   0.046441   0.123   0.9021    
## raven             0.389370   0.026448  14.722  < 2e-16 ***
## csize            -0.013965   0.003568  -3.914 9.60e-05 ***
## belief_survey1    0.007380   0.023611   0.313   0.7547    
## mathscore1        0.277750   0.025870  10.736  < 2e-16 ***
## gritintervention  0.339352   0.050364   6.738 2.51e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7922 on 1173 degrees of freedom
##   (319 observations deleted due to missingness)
## Multiple R-squared:  0.3659, Adjusted R-squared:  0.3627 
## F-statistic: 112.8 on 6 and 1173 DF,  p-value: < 2.2e-16

We get an estimate of the effect of the intervention (grit) that is close to the number in the paper (0.339 in our model vs 0.311 in the paper). We will refer to model1 as the author’s model.

A natural question to ask is what happens if we leave out the adjustment variables.

model2 <- lm(mathscore2 ~ mathscore1 + grit, data=data2)
summary(model2)
## 
## Call:
## lm(formula = mathscore2 ~ mathscore1 + grit, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2245 -0.5705  0.0657  0.5790  2.4128 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.05896    0.03298  -1.788 0.074081 .  
## mathscore1        0.46595    0.02507  18.583  < 2e-16 ***
## gritintervention  0.17631    0.05105   3.454 0.000572 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8763 on 1209 degrees of freedom
##   (287 observations deleted due to missingness)
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.2258 
## F-statistic: 177.6 on 2 and 1209 DF,  p-value: < 2.2e-16

We notice that the effect shrinks to 0.176. This is not to say that the effect of grit is overblown. But it does show that the choice of variables makes a difference. There are lots of variables to adjust for. So why pick the ones that were picked in the paper?

Adding a variable with responses to survey questions on grit from before the intervention seems relevant to measure the effect of grit for example.

model3 <- lm(mathscore2 ~  male + raven + csize + belief_survey1 + mathscore1 + grit + grit_survey1, data=data2)
summary(model3)
## 
## Call:
## lm(formula = mathscore2 ~ male + raven + csize + belief_survey1 + 
##     mathscore1 + grit + grit_survey1, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46442 -0.47280  0.03502  0.55390  1.97613 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.465399   0.127349   3.655 0.000269 ***
## male1             0.093703   0.046037   2.035 0.042043 *  
## raven             0.311069   0.027044  11.503  < 2e-16 ***
## csize            -0.017353   0.003455  -5.023 5.90e-07 ***
## belief_survey1   -0.005190   0.023093  -0.225 0.822231    
## mathscore1        0.175285   0.026999   6.492 1.26e-10 ***
## gritintervention  0.348122   0.049000   7.105 2.12e-12 ***
## grit_survey1      0.268547   0.027547   9.749  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7595 on 1139 degrees of freedom
##   (352 observations deleted due to missingness)
## Multiple R-squared:  0.4099, Adjusted R-squared:  0.4063 
## F-statistic:   113 on 7 and 1139 DF,  p-value: < 2.2e-16

We notice that the estimated effect shrinks to 0.269.

If we extend the list of adjustment variables with age and wealth, we get back to 0.321.

model4 <- lm(mathscore2 ~  male + raven + csize + belief_survey1 + mathscore1 + grit + grit_survey1 + age + wealth -1, data=data2)
summary(model4)
## 
## Call:
## lm(formula = mathscore2 ~ male + raven + csize + belief_survey1 + 
##     mathscore1 + grit + grit_survey1 + age + wealth - 1, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44528 -0.47096  0.01795  0.52989  1.87278 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## male0             0.609473   0.484953   1.257   0.2092    
## male1             0.739472   0.491155   1.506   0.1325    
## raven             0.305226   0.031312   9.748  < 2e-16 ***
## csize            -0.020066   0.003903  -5.142 3.35e-07 ***
## belief_survey1   -0.005337   0.025263  -0.211   0.8327    
## mathscore1        0.177044   0.029890   5.923 4.51e-09 ***
## gritintervention  0.320626   0.054362   5.898 5.23e-09 ***
## grit_survey1      0.263701   0.031839   8.282 4.42e-16 ***
## age              -0.007721   0.049663  -0.155   0.8765    
## wealth2           0.087266   0.085567   1.020   0.3081    
## wealth3           0.067552   0.081169   0.832   0.4055    
## wealth4           0.066026   0.094371   0.700   0.4843    
## wealth5          -0.276895   0.154563  -1.791   0.0736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7358 on 887 degrees of freedom
##   (599 observations deleted due to missingness)
## Multiple R-squared:  0.4091, Adjusted R-squared:  0.4004 
## F-statistic: 47.24 on 13 and 887 DF,  p-value: < 2.2e-16

In an experiment, the causal links between the variables and the outcome are cut by the randomizing process. In this experiment the cut was made at the group level however, which may have left causal links at the individual level in tact. Here we merely note the instability of the effect size given the choice of adjustment variables and proceed with the model of the authors.

Effect of missing data

Next, we want to know what the effect of the imputation is on the estimate of grit in model1.

fit_imp_lm <- with(data2_imp, lm(mathscore2 ~  male + raven + csize + belief_survey1 + mathscore1 +grit))

# Pool and summarize regression results
lm_pooled <- pool(fit_imp_lm)
summary(lm_pooled, conf.int = TRUE, conf.level = 0.95)
##               term     estimate   std.error  statistic        df      p.value
## 1      (Intercept)  0.289362847 0.124973480  2.3153940 151.76531 2.193095e-02
## 2            male1 -0.007879627 0.046268305 -0.1703029 140.12669 8.650176e-01
## 3            raven  0.373986917 0.034718433 10.7719989  13.88407 4.002833e-08
## 4            csize -0.010747931 0.003408281 -3.1534760 116.81258 2.051113e-03
## 5   belief_survey1  0.011308371 0.023105899  0.4894149 180.63505 6.251418e-01
## 6       mathscore1  0.281858191 0.026053697 10.8183567 125.65493 0.000000e+00
## 7 gritintervention  0.251121313 0.050437855  4.9788262  89.66478 3.078342e-06
##         2.5 %       97.5 %
## 1  0.04245044  0.536275257
## 2 -0.09935383  0.083594577
## 3  0.29946491  0.448508921
## 4 -0.01749797 -0.003997897
## 5 -0.03428382  0.056900558
## 6  0.23029732  0.333419065
## 7  0.15091261  0.351330020

We note that the effect of grit has decreases from 0.339 to 0.253, while the standard error barely moved.

We repeat this procedure for the estimate of grit on the scores on a math test 2.5 years later.

# Model of mathscore3 with missing data left out
model5 <- lm(mathscore3 ~  male + raven + csize + belief_survey1 + mathscore1 +grit, data=data2)
summary(model5)
## 
## Call:
## lm(formula = mathscore3 ~ male + raven + csize + belief_survey1 + 
##     mathscore1 + grit, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40143 -0.56044 -0.06897  0.53427  2.56689 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.377996   0.184727   2.046  0.04112 *  
## male1             0.056511   0.062857   0.899  0.36896    
## raven             0.290866   0.035700   8.148 1.82e-15 ***
## csize            -0.012171   0.004978  -2.445  0.01475 *  
## belief_survey1    0.055629   0.032630   1.705  0.08869 .  
## mathscore1        0.358461   0.035002  10.241  < 2e-16 ***
## gritintervention  0.212708   0.069642   3.054  0.00235 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8131 on 671 degrees of freedom
##   (821 observations deleted due to missingness)
## Multiple R-squared:  0.3377, Adjusted R-squared:  0.3318 
## F-statistic: 57.03 on 6 and 671 DF,  p-value: < 2.2e-16
fit_imp_lm <- with(data2_imp, lm(mathscore3 ~  male + raven + csize + belief_survey1 + mathscore1 +grit))

# Model of mathscore3 with missing data imputed
lm_pooled <- pool(fit_imp_lm)
summary(lm_pooled)
##               term    estimate   std.error  statistic        df      p.value
## 1      (Intercept)  0.45694905 0.146080034  3.1280732  32.97576 3.665613e-03
## 2            male1  0.04869399 0.050143628  0.9710902  64.74918 3.351171e-01
## 3            raven  0.27036177 0.027723220  9.7521778  76.38398 4.662937e-15
## 4            csize -0.01327866 0.003569623 -3.7199057  89.06549 3.482151e-04
## 5   belief_survey1  0.02046722 0.025358982  0.8070995  64.57416 4.225721e-01
## 6       mathscore1  0.31669721 0.030732058 10.3051094  28.78585 3.613332e-11
## 7 gritintervention  0.13426666 0.049164673  2.7309580 260.45967 6.746394e-03

The model that uses imputed values estimates the effect to be 0.138, compared to 0.213 for the same model where missing data were simply left out.

Multilevel Bayesian inference

We now extend the model to include explanations at the group level. To this end we add \(\alpha_{[j]i}\), which can be read as a sequence of indicator variables, one for each of the J schools.

\[y_{i} \sim N(\alpha_{[j]i}+X_{i}\beta +\sigma_{i}),\ for\ \ i\ = 1, 2,\ ...,\ n \]

The model become multilevel because we set up a model for the indicators at the group level. That is, we model the school indicators as normally distributed, where \(\gamma_{1}\) denotes if they were in the treatment group or not. The random assignment took place at the level of the school, so it makes sense to insert the grit variable at this level.

\[\alpha_{j} \sim N(\gamma_{0}+\gamma_{1}u_{j} +\sigma_{j}),\ for\ \ j\ = 1, 2,\ ...,\ J \]

brm_school1 <- 
  brm_multiple(data = data2_imp, 
      family = gaussian,
      mathscore2 ~  1 + male + raven + csize + belief_survey1 + mathscore1 + grit + (1 | schoolid),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,  
      seed = 4387510, 
      file = "fits/brm_school1")

print(brm_school1)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: mathscore2 ~ 1 + male + raven + csize + belief_survey1 + mathscore1 + grit + (1 | schoolid) 
##    Data: data2_imp (Number of observations: 1499) 
## Samples: 20 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 20000
## 
## Group-Level Effects: 
## ~schoolid (Number of levels: 16) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.18      0.05     0.10     0.31 1.02     1260     9741
## 
## Population-Level Effects: 
##                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept            0.11      0.20    -0.29     0.49 1.02      954    14022
## male1                0.02      0.04    -0.06     0.11 1.02      867    13295
## raven                0.38      0.03     0.32     0.43 1.10      118      464
## csize               -0.01      0.01    -0.02     0.00 1.02      807    14678
## belief_survey1       0.01      0.02    -0.03     0.06 1.16       81      218
## mathscore1           0.26      0.03     0.21     0.31 1.05      229     2165
## gritintervention     0.24      0.11     0.02     0.46 1.01     2582    10806
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.81      0.02     0.78     0.85 1.21       65      409
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

We notice that the effect of the grit intervention is now estimated to be 0.24, which is a little smaller than the individual level regression (where a bayesian version not shown here gave the same estimate). Furthermore, the standard deviation of math scores among schools is 0.18, while the variation between students is captured by a standard deviation of 0.81. Hence, there will be a significant pull on the estimate towards the group means.

Now what’s the fun of having only one level? We also have variation at a level between that of the student and the school, which has been captured in the data set by the variable classid. So let’s add it in.

\[y_{i} \sim N(\eta_{[k]i}+X_{i}\beta +\sigma_{i}),\ for\ \ i\ = 1, 2,\ ...,\ n \]

\[\eta{j} \sim N(\nu_{k} +\alpha_{[j]k}+\sigma_{k}),\ for\ \ k\ = 1, 2,\ ...,\ K \]

\[\alpha_{j} \sim N(\gamma_{0}+\gamma_{1}u_{j} +\sigma_{j}),\ for\ \ j\ = 1, 2,\ ...,\ J \]

brm_school_class5 <- 
  brm_multiple(data = data2_imp, 
      family = gaussian,
      mathscore2 ~  1 + male + raven + csize + belief_survey1 + mathscore1 + grit + (1 | schoolid/classid),
      prior = c(prior(normal(0, 10), class = Intercept),
                prior(normal(0, 10), class = b),
                prior(cauchy(0, 10), class = sd),
                prior(cauchy(0, 10), class = sigma)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,  
      seed = 4387510,
      control = list(adapt_delta = .95),
      file = "fits/brm_school_class5")

print(brm_school_class5)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: mathscore2 ~ 1 + male + raven + csize + belief_survey1 + mathscore1 + grit + (1 | schoolid/classid) 
##    Data: data2_imp (Number of observations: 1499) 
## Samples: 20 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 20000
## 
## Group-Level Effects: 
## ~schoolid (Number of levels: 16) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.12      0.07     0.01     0.27 1.01     2694     4826
## 
## ~schoolid:classid (Number of levels: 42) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.21      0.04     0.13     0.29 1.04      320     1891
## 
## Population-Level Effects: 
##                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept            0.23      0.24    -0.25     0.69 1.01     7008    13191
## male1                0.03      0.04    -0.05     0.11 1.02      826    12972
## raven                0.35      0.03     0.30     0.41 1.12      105      439
## csize               -0.01      0.01    -0.02     0.00 1.01     6271    12389
## belief_survey1       0.02      0.02    -0.03     0.06 1.14       88      251
## mathscore1           0.25      0.03     0.20     0.30 1.07      178      619
## gritintervention     0.22      0.11     0.00     0.44 1.01     6795    11280
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.80      0.02     0.76     0.83 1.25       57      378
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

In this final model, in which we took account of variation at the class level as well, the estimate of grit is smaller still at 0.22. As we account for more variation that was not adjusted for by the randomization at the school level, the effect shrinks further still. On the other hand, it does not disappear either, so that there is reason to believe that the intervention was effective.

Edi Terlaak
Edi Terlaak

I like to tell stories about statistics.