0. Введение

Некоторые думают, что линейная регрессия решит все их проблемы (по крайней мере те из них, которые связаны с предсказанием какой-то числовой переменной). Это так. Но нужно быть осторожным — у регрессии есть свои ограничения на применение.

0.1 Библиотеки

library(tidyverse)
library(lme4)
library(lmerTest)
library(lingtypology) # only for linguistic mapping

0.2 Lexical Decision Task data

Dataset and description from Rling package by Natalia Levshina. This data set contains 100 randomly selected words from the English Lexicon Project data (Balota et al. 2007), their lengths, mean reaction times and corpus frequencies.

ldt <- read_csv("https://goo.gl/ToxfU6")
ldt

0.3 UPSID

In this dataset we have number of consonants and vowels in 402 languages collected from UPSID database (http://www.lapsyd.ddl.ish-lyon.cnrs.fr/lapsyd/). There is an variable of the area based on Glottolog (http://glottolog.org/). In this part we will try to make models that predict number of vowels by number of consonants.

upsid <- read_csv("https://raw.githubusercontent.com/agricolamz/2019_data_analysis_for_linguists/master/data/upsid.csv")

## Parsed with column specification:
## cols(
##   language = col_character(),
##   area = col_character(),
##   consonants = col_double(),
##   vowels = col_double()
## )

upsid

map.feature(upsid$language, 
            features = upsid$area,
            label = upsid$language,
            label.hide = TRUE)

1. Нелинейность взаимосвязи

Давайте посмотрим на простой график:

ldt %>% 
  ggplot(aes(Mean_RT, Freq))+
  geom_point()+
  theme_bw()

Регрессия на таких данных будет супер неиформативна:

ldt %>% 
  ggplot(aes(Mean_RT, Freq))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()

m1 <- summary(lm(Mean_RT~Freq, data = ldt))
m1

## 
## Call:
## lm(formula = Mean_RT ~ Freq, data = ldt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -224.93  -85.42  -30.52   81.90  632.66 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 826.998242  15.229783  54.301  < 2e-16 ***
## Freq         -0.005595   0.001486  -3.765 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 143.9 on 98 degrees of freedom
## Multiple R-squared:  0.1264, Adjusted R-squared:  0.1174 
## F-statistic: 14.17 on 1 and 98 DF,  p-value: 0.0002843

1.1 Логарифмирование

ldt %>% 
  ggplot(aes(Mean_RT, log(Freq)))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()

## Warning: Removed 5 rows containing non-finite values (stat_smooth).

ldt %>% 
  ggplot(aes(Mean_RT, log(Freq+1)))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()

m2 <- summary(lm(Mean_RT~log(Freq+1), data = ldt))
m2

## 
## Call:
## lm(formula = Mean_RT ~ log(Freq + 1), data = ldt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -242.36  -76.66  -17.49   48.64  630.49 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1001.60      29.79  33.627  < 2e-16 ***
## log(Freq + 1)   -34.03       4.76  -7.149 1.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124.8 on 98 degrees of freedom
## Multiple R-squared:  0.3428, Adjusted R-squared:  0.3361 
## F-statistic: 51.11 on 1 and 98 DF,  p-value: 1.576e-10

m1$adj.r.squared

## [1] 0.1174405

m2$adj.r.squared

## [1] 0.336078

Отлогорифмировать можно и другую переменную.

ldt %>% 
  ggplot(aes(log(Mean_RT), log(Freq  + 1)))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()

m3 <- summary(lm(log(Mean_RT)~log(Freq+1), data = ldt))
m1$adj.r.squared

## [1] 0.1174405

m2$adj.r.squared

## [1] 0.336078

m3$adj.r.squared

## [1] 0.3838649

Как интерпретировать полученную регрессию с двумя отлогорифмированными значениями?

В обычной линейной регресии мы узнаем отношения между \(x\) и \(y\): \[y_i = \beta_0+\beta_1\times x_i\]

Как изменится \(y_j\), если мы увеличем \(x_i + 1 = x_j\)? \[y_j = \beta_0+\beta_1\times x_j\]

\[y_j - y_i = \beta_0+\beta_1\times x_j - (\beta_0+\beta_1\times x_i) = \beta_1(x_j - x_i)\]

Т. е. \(y\) увеличится на \(\beta_1\) , если \(x\) увеличится на 1. Что же будет с логарифмированными переменными? Как изменится \(y_j\), если мы увеличем \(x_i + 1 = x_j\)?

\[\log(y_j) - \log(y_i) = \beta_1\times (\log(x_j) - \log(x_i))\]

\[\log\left(\frac{y_j}{y_i}\right) = \beta_1\times \log\left(\frac{x_j}{x_i}\right) = \log\left(\frac{x_j}{x_i}\right) ^ {\beta_1}\]

\[\frac{y_j}{y_i}= \left(\frac{x_j}{x_i}\right) ^ {\beta_1}\]

Т. е. \(y\) увеличится на \(\beta_1\) процентов, если \(x\) увеличится на 1 процент.

Логарифмирование — не единственный вид траснформации:

трансформация Тьюки

shiny::runGitHub("agricolamz/tukey_transform")

трансформация Бокса — Кокса
…

2. Нормальность распределение остатков

Линейная регрессия предполагает нормальность распределения остатков. Когда связь не линейна, то остатки тоже будут распределены не нормально.

qqnorm(m1$residuals)
qqline(m1$residuals)

qqnorm(m2$residuals)
qqline(m2$residuals)

qqnorm(m3$residuals)
qqline(m3$residuals)

3. Гетероскидастичность

Распределение остатков непостоянно (т. е. не гомоскидастичны):

ldt %>% 
  ggplot(aes(Mean_RT, Freq))+
  geom_point()+
  theme_bw()

Тоже решается преобазованием данных.

4. Мультиколлинеарность

Линейная связь между некоторыми предикторами в модели.

корреляционная матрица
VIF (Variance inflation factor), car::vif()
VIF = 1 (Not correlated)
1 < VIF < 5 (Moderately correlated)
VIF >=5 (Highly correlated)

5. Независимость наблюдений

Наблюдения должны быть независимы.

upsid %>% 
  ggplot(aes(consonants, vowels))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw()

Обведем наблюдения по каждому спикеру:

upsid %>% 
  ggplot(aes(consonants, vowels, color = area))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw()+
  stat_ellipse()

Построим простую регрессию и добавим ее на график:

fit1 <- lm(vowels~consonants, data = upsid)
summary(fit1)

## 
## Call:
## lm(formula = vowels ~ consonants, data = upsid)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.148 -2.897 -1.208  2.441 33.203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.96920    0.56512  10.563  < 2e-16 ***
## consonants   0.11259    0.02317   4.859 1.63e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.67 on 449 degrees of freedom
## Multiple R-squared:  0.04996,    Adjusted R-squared:  0.04785 
## F-statistic: 23.61 on 1 and 449 DF,  p-value: 1.63e-06

upsid %>% 
  ggplot(aes(consonants, vowels))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw()+
  geom_line(data = fortify(fit1), aes(x = consonants, y = .fitted), color = "blue")

fit2 <- lmer(vowels ~ consonants + (1|area), data = upsid)
summary(fit2)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (1 | area)
##    Data: upsid
## 
## REML criterion at convergence: 2649.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7378 -0.6176 -0.2127  0.4484  7.2066 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  area     (Intercept)  3.336   1.826   
##  Residual             20.053   4.478   
## Number of obs: 451, groups:  area, 6
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   6.07766    0.94133   9.15211   6.456 0.000109 ***
## consonants    0.08212    0.02458 446.67796   3.341 0.000905 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## consonants -0.556

upsid %>% 
  ggplot(aes(consonants, vowels))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw() +
  geom_line(data = fortify(fit2), aes(x = consonants, y = .fitted, color = area))

Если мы предполагаем скоррелированность свободных эффектов:

fit3 <- lmer(vowels ~ consonants + (1+consonants|area), data = upsid)
summary(fit3)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (1 + consonants | area)
##    Data: upsid
## 
## REML criterion at convergence: 2640
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3289 -0.5774 -0.2207  0.4203  7.4031 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  area     (Intercept)  9.69166 3.1131        
##           consonants   0.01725 0.1313   -0.82
##  Residual             19.33374 4.3970        
## Number of obs: 451, groups:  area, 6
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)   
## (Intercept)  6.64998    1.45365 4.31016   4.575  0.00857 **
## consonants   0.05186    0.06396 4.01690   0.811  0.46274   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## consonants -0.848

upsid %>% 
  ggplot(aes(consonants, vowels))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw()+
  geom_line(data = fortify(fit3), aes(x = consonants, y = .fitted, color = area))

fit4 <- lmer(vowels ~ consonants + (0+consonants|area), data = upsid)
summary(fit4)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (0 + consonants | area)
##    Data: upsid
## 
## REML criterion at convergence: 2650.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8910 -0.5705 -0.2539  0.4255  7.2020 
## 
## Random effects:
##  Groups   Name       Variance Std.Dev.
##  area     consonants  0.00773 0.08792 
##  Residual            20.10674 4.48405 
## Number of obs: 451, groups:  area, 6
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   6.93417    0.61286 437.37534  11.314   <2e-16 ***
## consonants    0.03417    0.04630   8.68312   0.738     0.48    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## consonants -0.578

upsid %>% 
  ggplot(aes(consonants, vowels))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw()+
  geom_line(data = fortify(fit4), aes(x = consonants, y = .fitted, color = area))

Если мы не предполагаем скоррелированность свободных эффектов:

fit5 <- lmer(vowels ~ consonants + (1|area) + (0+consonants|area), data = upsid)
summary(fit5)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (1 | area) + (0 + consonants | area)
##    Data: upsid
## 
## REML criterion at convergence: 2643.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1630 -0.6011 -0.2275  0.4219  7.3599 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  area     (Intercept)  4.736252 2.17629 
##  area.1   consonants   0.008715 0.09335 
##  Residual             19.441470 4.40925 
## Number of obs: 451, groups:  area, 6
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)   
## (Intercept)  6.70367    1.10142 5.13790   6.086  0.00157 **
## consonants   0.04515    0.04942 4.20204   0.914  0.41027   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## consonants -0.347

upsid %>% 
  ggplot(aes(consonants, vowels))+
  geom_point()+
  labs(x = "number of consonants",
       y = "number of vowels",
       caption = "data from LAPSyD")+
  theme_bw()+
  geom_line(data = fortify(fit5), aes(x = consonants, y = .fitted, color = area))

fit0 <- lmer(vowels ~ 1 + (1|area) + (0+consonants|area), data = upsid)
summary(fit0)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ 1 + (1 | area) + (0 + consonants | area)
##    Data: upsid
## 
## REML criterion at convergence: 2640
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1165 -0.6101 -0.2245  0.4076  7.3753 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  area     (Intercept)  4.577349 2.13947 
##  area.1   consonants   0.009654 0.09825 
##  Residual             19.420625 4.40688 
## Number of obs: 451, groups:  area, 6
## 
## Fixed effects:
##             Estimate Std. Error    df t value Pr(>|t|)   
## (Intercept)    7.049      1.025 4.080   6.879  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit5, fit4, fit3, fit2, fit1, fit0)

## refitting model(s) with ML (instead of REML)

Ограничения на использования линейной регрессии и смешанные регрессионные модели

Г. Мороз