11 Байесовский регрессионный анализ

library(tidyverse)

11.1 Основы регрессионного анализа

Когда мы используем регрессионный анализ, мы пытаемся оценить два параметра:

  • свободный член (intercept) – значение \(y\) при \(x = 0\);
  • угловой коэффициент (slope) – изменение \(y\) при изменении \(x\) на одну единицу.

\[y_i = \beta_0 + \beta_1\times x_i + \epsilon_i\]

Причем, иногда мы можем один или другой параметр считать равным нулю.

При этом, вне зависимости от статистической школы, у регрессии есть свои ограничения на применение:

  • линейность связи между \(x\) и \(y\);
  • нормальность распределение остатков \(\epsilon_i\);
  • гомоскидастичность — равномерность распределения остатков на всем протяжении \(x\);
  • независимость переменных;
  • независимость наблюдений друг от друга.

11.2 brms

Для анализа возьмем датасет, который я составил из UD-корпусов и попробуем смоделировать связь между количеством слов в тексте и количеством уникальных слов (закон Хердана-Хипса).

ud <- read_csv("https://raw.githubusercontent.com/agricolamz/udpipe_count_n_words_and_tokens/master/filtered_dataset.csv")
glimpse(ud)
Rows: 20,705
Columns: 5
$ doc_id      <chr> "KR1d0052_001", "KR1d0052_002", "KR1d0052_003", "KR1d0052_…
$ n_words     <dbl> 3516, 2131, 4927, 4884, 4245, 5027, 3406, 2202, 2673, 2300…
$ n_tokens    <dbl> 842, 546, 869, 883, 737, 1085, 494, 443, 573, 578, 660, 87…
$ language    <chr> "Classical_Chinese", "Classical_Chinese", "Classical_Chine…
$ corpus_code <chr> "Kyoto", "Kyoto", "Kyoto", "Kyoto", "Kyoto", "Kyoto", "Kyo…

Для начала, нарушим кучу ограничений на применение регрессии и смоделируем модель для вот таких вот данных, взяв только тексты меньше 1500 слов:

ud %>%
  filter(n_words < 1500) ->
  ud

ud %>% 
  ggplot(aes(n_words, n_tokens))+
  geom_point()

11.2.1 Модель только со свободным членом

library(brms)
parallel::detectCores()
[1] 8
n_cores <- 7 # parallel::detectCores() - 1

get_prior(n_tokens ~ 1, 
          data = ud)

Вот модель с встроенными априорными распределениями:

fit_intercept <- brm(n_tokens ~ 1, 
                     data = ud,
                     cores = n_cores, refresh = 0, silent = TRUE)
Running /usr/lib/R/bin/R CMD SHLIB foo.c
gcc -std=gnu99 -std=gnu11 -I"/usr/share/R/include" -DNDEBUG   -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/Rcpp/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/unsupported"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/BH/include" -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/src/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppParallel/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-i2PIHO/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c foo.c -o foo.o
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:88,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name ‘namespace’
  628 | namespace Eigen {
      | ^~~~~~~~~
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
  628 | namespace Eigen {
      |                 ^
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:96:10: fatal error: complex: No such file or directory
   96 | #include <complex>
      |          ^~~~~~~~~
compilation terminated.
make: *** [/usr/lib/R/etc/Makeconf:168: foo.o] Error 1

При желании встроенные априорные расспеределения можно не использовать и вставлять в аргумент prior априорные распределения по вашему желанию.

fit_intercept
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: n_tokens ~ 1 
   Data: ud (Number of observations: 20282) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   135.65      0.85   134.00   137.30 1.00     2734     2548

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   121.75      0.60   120.58   122.93 1.00     3767     2604

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_intercept)

Взято остюда:

11.2.1.1 R-hat

R-hat convergence diagnostic compares the between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well (ie, the between- and within-chain estimates don’t agree), R-hat is larger than 1. We recommend running at least four chains by default and only using the sample if R-hat is less than 1.01. Stan reports R-hat which is the maximum of rank normalized split-R-hat and rank normalized folded-split-R-hat, which works for thick tailed distributions and is sensitive also to differences in scale. For more details on this diagnostic, see https://arxiv.org/abs/1903.08008.

Recommendations:

  • Look at Bulk- and Tail-ESS for further information.
  • Look at the rank plot to see how the chains differ from each other.
  • Look at the local and quantile efficiency plots.
  • You might try setting a higher value for the iter argument. By default iter is 2000

11.2.1.2 Bulk ESS

Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. Clearly, the higher the ESS the better. Stan uses R-hat adjustment to use the between-chain information in computing the ESS. For example, in case of multimodal distributions with well-separated modes, this leads to an ESS estimate that is close to the number of distinct modes that are found.

Bulk-ESS refers to the effective sample size based on the rank normalized draws. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Overall bulk-ESS estimates the sampling efficiency for the location of the distribution (e.g. mean and median).

Often quite smaller ESS would be sufficient for the desired estimation accuracy, but the estimation of ESS and convergence diagnostics themselves require higher ESS. We recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400.

Recommendations:

  • You might try setting a higher value for the iter argument. By default iter is 2000
  • Look at the rank plot to see how the chains differ from each other.
  • Look at the local and quantile efficiency plots.
  • Look at change in bulk-ESS when the number of iterations increase. If R-hat is less than 1.01 and bulk-ESS grows linearly with the number of iterations and eventually exceeds the recommended limit, the mixing is sufficient but MCMC has high autocorrelation requiring a large number of iterations

11.2.1.3 Tail ESS

Tail-ESS computes the minimum of the effective sample sizes (ESS) of the 5% and 95% quantiles. Tail-ESS can help diagnosing problems due to different scales of the chains and slow mixing in the tails. See also general information about ESS above in description of bulk-ESS.

Recommendations:

  • You might try setting a higher value for the iter argument. By default iter is 2000
  • Look at the rank plot to see how the chains differ from each other.
  • Look at the local and quantile efficiency plots.
  • Look at change in tail-ESS when the number of iterations increase. If R-hat is less than 1.01 and tail-ESS grows linearly with the number of iterations and eventually exceeds the recommended limit, the mixing is sufficient but MCMC has high autocorrelation requiring a large number of iterations

11.2.1.4 Вернемся к нашим баранам

Давайте посмотрим на наши данные:

Оригинал мема. Вот еще один.

11.2.2 Модель только с угловым коэффициентом

get_prior(n_tokens ~ n_words+0,
          data = ud)
fit_slope <- brm(n_tokens ~ n_words+0, 
                 data = ud,
                 cores = n_cores, refresh = 0, silent = TRUE)
Running /usr/lib/R/bin/R CMD SHLIB foo.c
gcc -std=gnu99 -std=gnu11 -I"/usr/share/R/include" -DNDEBUG   -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/Rcpp/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/unsupported"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/BH/include" -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/src/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppParallel/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-i2PIHO/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c foo.c -o foo.o
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:88,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name ‘namespace’
  628 | namespace Eigen {
      | ^~~~~~~~~
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
  628 | namespace Eigen {
      |                 ^
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:96:10: fatal error: complex: No such file or directory
   96 | #include <complex>
      |          ^~~~~~~~~
compilation terminated.
make: *** [/usr/lib/R/etc/Makeconf:168: foo.o] Error 1
fit_slope
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: n_tokens ~ n_words + 0 
   Data: ud (Number of observations: 20282) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
n_words     0.53      0.00     0.53     0.53 1.00     4360     3010

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    42.79      0.21    42.40    43.20 1.01     1180     1357

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_slope)

Давайте совместим предсказания модели и наши наблюдения.

library(tidybayes)

ud %>% 
  add_epred_draws(fit_slope, ndraws = 50) %>%  
  ggplot(aes(n_words, n_tokens))+
  geom_point(alpha = 0.01)+
  stat_lineribbon(aes(y = .epred), color = "red") 

11.2.3 Модель с угловым коэффициентом и свободным членом

get_prior(n_tokens ~ n_words,
          data = ud)
fit_slope_intercept <- brm(n_tokens ~ n_words,
                           data = ud,
                           cores = n_cores, refresh = 0, silent = TRUE)
Running /usr/lib/R/bin/R CMD SHLIB foo.c
gcc -std=gnu99 -std=gnu11 -I"/usr/share/R/include" -DNDEBUG   -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/Rcpp/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/unsupported"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/BH/include" -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/src/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppParallel/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-i2PIHO/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c foo.c -o foo.o
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:88,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name ‘namespace’
  628 | namespace Eigen {
      | ^~~~~~~~~
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
  628 | namespace Eigen {
      |                 ^
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:96:10: fatal error: complex: No such file or directory
   96 | #include <complex>
      |          ^~~~~~~~~
compilation terminated.
make: *** [/usr/lib/R/etc/Makeconf:168: foo.o] Error 1
fit_slope_intercept
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: n_tokens ~ n_words 
   Data: ud (Number of observations: 20282) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    24.31      0.37    23.60    25.03 1.00     2919     2740
n_words       0.48      0.00     0.48     0.48 1.00     4859     2569

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    39.05      0.19    38.67    39.44 1.00     2290     2065

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_slope_intercept)

ud %>% 
  add_epred_draws(fit_slope_intercept, ndraws = 50) %>%  
  ggplot(aes(n_words, n_tokens))+
  geom_point(alpha = 0.01)+
  stat_lineribbon(aes(y = .epred), color = "red") 

11.2.4 Модель со смешанными эффектами

В данных есть группировка по языкам, которую мы все это время игнорировали. Давайте сделаем модель со смешанными эффектами:

get_prior(n_tokens ~ n_words+(1|language),
          data = ud)
fit_mixed <- brm(n_tokens ~ n_words + (1|language),
                 data = ud,
                 cores = n_cores, refresh = 0, silent = TRUE)
Running /usr/lib/R/bin/R CMD SHLIB foo.c
gcc -std=gnu99 -std=gnu11 -I"/usr/share/R/include" -DNDEBUG   -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/Rcpp/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/unsupported"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/BH/include" -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/src/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppParallel/include/"  -I"/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1      -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-i2PIHO/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c foo.c -o foo.o
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:88,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name ‘namespace’
  628 | namespace Eigen {
      | ^~~~~~~~~
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:17: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
  628 | namespace Eigen {
      |                 ^
In file included from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Dense:1,
                 from /home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13,
                 from <command-line>:
/home/agricolamz/R/x86_64-pc-linux-gnu-library/4.1/RcppEigen/include/Eigen/Core:96:10: fatal error: complex: No such file or directory
   96 | #include <complex>
      |          ^~~~~~~~~
compilation terminated.
make: *** [/usr/lib/R/etc/Makeconf:168: foo.o] Error 1
fit_mixed
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: n_tokens ~ n_words + (1 | language) 
   Data: ud (Number of observations: 20282) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~language (Number of levels: 9) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    64.03     18.87    38.13   111.85 1.01      558     1054

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     5.91     20.69   -36.51    46.11 1.00      642      824
n_words       0.49      0.00     0.49     0.49 1.00     4075     2333

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    29.28      0.14    29.01    29.55 1.00     2316     1988

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_mixed)

ud %>% 
  add_epred_draws(fit_mixed, ndraws = 50) %>%  
  ggplot(aes(n_words, n_tokens))+
  geom_point(alpha = 0.01)+
  stat_lineribbon(aes(y = .epred), color = "red") +
  facet_wrap(~language)

То, что получилось учитывает общий эффект всех языков: посмотрите на каталанский. Если построить модель по каждому языку, то получится совсем другая картина:

ud %>% 
  ggplot(aes(n_words, n_tokens))+
  geom_smooth(method = "lm") +
  geom_point(alpha = 0.3)+
  facet_wrap(~language)

В работе (Coretta 2016) собраны данные длительности исландских гласных. Используя байесовскую регрессию, смоделируйте длительность гласного (vowel.dur) в зависимости от аспирированности (aspiration) учитывая эффект носителя. Визуализируйте результаты модели

Ссылки на литературу

Coretta, Stefano. 2016. “Vowel Duration and Aspiration Effects in Icelandic.” University of York.