Некоторые думают, что линейная регрессия решит все их проблемы (по крайней мере те из них, которые связаны с предсказанием какой-то числовой переменной). Это так. Но нужно быть осторожным — у регрессии есть свои ограничения на применение.
library(tidyverse)
library(lme4)
library(lmerTest)
library(lingtypology) # only for linguistic mapping
Dataset and description from Rling package by Natalia Levshina. This data set contains 100 randomly selected words from the English Lexicon Project data (Balota et al. 2007), their lengths, mean reaction times and corpus frequencies.
ldt <- read_csv("https://goo.gl/ToxfU6")
ldt
In this dataset we have number of consonants and vowels in 402 languages collected from UPSID database (http://www.lapsyd.ddl.ish-lyon.cnrs.fr/lapsyd/). There is an variable of the area based on Glottolog (http://glottolog.org/). In this part we will try to make models that predict number of vowels by number of consonants.
upsid <- read_csv("https://raw.githubusercontent.com/agricolamz/2019_data_analysis_for_linguists/master/data/upsid.csv")
## Parsed with column specification:
## cols(
## language = col_character(),
## area = col_character(),
## consonants = col_double(),
## vowels = col_double()
## )
upsid
map.feature(upsid$language,
features = upsid$area,
label = upsid$language,
label.hide = TRUE)
Давайте посмотрим на простой график:
ldt %>%
ggplot(aes(Mean_RT, Freq))+
geom_point()+
theme_bw()
Регрессия на таких данных будет супер неиформативна:
ldt %>%
ggplot(aes(Mean_RT, Freq))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()
m1 <- summary(lm(Mean_RT~Freq, data = ldt))
m1
##
## Call:
## lm(formula = Mean_RT ~ Freq, data = ldt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.93 -85.42 -30.52 81.90 632.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 826.998242 15.229783 54.301 < 2e-16 ***
## Freq -0.005595 0.001486 -3.765 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 143.9 on 98 degrees of freedom
## Multiple R-squared: 0.1264, Adjusted R-squared: 0.1174
## F-statistic: 14.17 on 1 and 98 DF, p-value: 0.0002843
ldt %>%
ggplot(aes(Mean_RT, log(Freq)))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
ldt %>%
ggplot(aes(Mean_RT, log(Freq+1)))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()
m2 <- summary(lm(Mean_RT~log(Freq+1), data = ldt))
m2
##
## Call:
## lm(formula = Mean_RT ~ log(Freq + 1), data = ldt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -242.36 -76.66 -17.49 48.64 630.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1001.60 29.79 33.627 < 2e-16 ***
## log(Freq + 1) -34.03 4.76 -7.149 1.58e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124.8 on 98 degrees of freedom
## Multiple R-squared: 0.3428, Adjusted R-squared: 0.3361
## F-statistic: 51.11 on 1 and 98 DF, p-value: 1.576e-10
m1$adj.r.squared
## [1] 0.1174405
m2$adj.r.squared
## [1] 0.336078
Отлогорифмировать можно и другую переменную.
ldt %>%
ggplot(aes(log(Mean_RT), log(Freq + 1)))+
geom_point()+
geom_smooth(method = "lm")+
theme_bw()
m3 <- summary(lm(log(Mean_RT)~log(Freq+1), data = ldt))
m1$adj.r.squared
## [1] 0.1174405
m2$adj.r.squared
## [1] 0.336078
m3$adj.r.squared
## [1] 0.3838649
Как интерпретировать полученную регрессию с двумя отлогорифмированными значениями?
В обычной линейной регресии мы узнаем отношения между \(x\) и \(y\): \[y_i = \beta_0+\beta_1\times x_i\]
Как изменится \(y_j\), если мы увеличем \(x_i + 1 = x_j\)? \[y_j = \beta_0+\beta_1\times x_j\]
\[y_j - y_i = \beta_0+\beta_1\times x_j - (\beta_0+\beta_1\times x_i) = \beta_1(x_j - x_i)\]
Т. е. \(y\) увеличится на \(\beta_1\) , если \(x\) увеличится на 1. Что же будет с логарифмированными переменными? Как изменится \(y_j\), если мы увеличем \(x_i + 1 = x_j\)?
\[\log(y_j) - \log(y_i) = \beta_1\times (\log(x_j) - \log(x_i))\]
\[\log\left(\frac{y_j}{y_i}\right) = \beta_1\times \log\left(\frac{x_j}{x_i}\right) = \log\left(\frac{x_j}{x_i}\right) ^ {\beta_1}\]
\[\frac{y_j}{y_i}= \left(\frac{x_j}{x_i}\right) ^ {\beta_1}\]
Т. е. \(y\) увеличится на \(\beta_1\) процентов, если \(x\) увеличится на 1 процент.
Логарифмирование — не единственный вид траснформации:
shiny::runGitHub("agricolamz/tukey_transform")
Линейная регрессия предполагает нормальность распределения остатков. Когда связь не линейна, то остатки тоже будут распределены не нормально.
qqnorm(m1$residuals)
qqline(m1$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
Распределение остатков непостоянно (т. е. не гомоскидастичны):
ldt %>%
ggplot(aes(Mean_RT, Freq))+
geom_point()+
theme_bw()
Тоже решается преобазованием данных.
Линейная связь между некоторыми предикторами в модели.
car::vif()
Наблюдения должны быть независимы.
upsid %>%
ggplot(aes(consonants, vowels))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw()
Обведем наблюдения по каждому спикеру:
upsid %>%
ggplot(aes(consonants, vowels, color = area))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw()+
stat_ellipse()
Построим простую регрессию и добавим ее на график:
fit1 <- lm(vowels~consonants, data = upsid)
summary(fit1)
##
## Call:
## lm(formula = vowels ~ consonants, data = upsid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.148 -2.897 -1.208 2.441 33.203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.96920 0.56512 10.563 < 2e-16 ***
## consonants 0.11259 0.02317 4.859 1.63e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.67 on 449 degrees of freedom
## Multiple R-squared: 0.04996, Adjusted R-squared: 0.04785
## F-statistic: 23.61 on 1 and 449 DF, p-value: 1.63e-06
upsid %>%
ggplot(aes(consonants, vowels))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw()+
geom_line(data = fortify(fit1), aes(x = consonants, y = .fitted), color = "blue")
fit2 <- lmer(vowels ~ consonants + (1|area), data = upsid)
summary(fit2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (1 | area)
## Data: upsid
##
## REML criterion at convergence: 2649.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7378 -0.6176 -0.2127 0.4484 7.2066
##
## Random effects:
## Groups Name Variance Std.Dev.
## area (Intercept) 3.336 1.826
## Residual 20.053 4.478
## Number of obs: 451, groups: area, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.07766 0.94133 9.15211 6.456 0.000109 ***
## consonants 0.08212 0.02458 446.67796 3.341 0.000905 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## consonants -0.556
upsid %>%
ggplot(aes(consonants, vowels))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw() +
geom_line(data = fortify(fit2), aes(x = consonants, y = .fitted, color = area))
Если мы предполагаем скоррелированность свободных эффектов:
fit3 <- lmer(vowels ~ consonants + (1+consonants|area), data = upsid)
summary(fit3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (1 + consonants | area)
## Data: upsid
##
## REML criterion at convergence: 2640
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3289 -0.5774 -0.2207 0.4203 7.4031
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## area (Intercept) 9.69166 3.1131
## consonants 0.01725 0.1313 -0.82
## Residual 19.33374 4.3970
## Number of obs: 451, groups: area, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.64998 1.45365 4.31016 4.575 0.00857 **
## consonants 0.05186 0.06396 4.01690 0.811 0.46274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## consonants -0.848
upsid %>%
ggplot(aes(consonants, vowels))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw()+
geom_line(data = fortify(fit3), aes(x = consonants, y = .fitted, color = area))
fit4 <- lmer(vowels ~ consonants + (0+consonants|area), data = upsid)
summary(fit4)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (0 + consonants | area)
## Data: upsid
##
## REML criterion at convergence: 2650.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8910 -0.5705 -0.2539 0.4255 7.2020
##
## Random effects:
## Groups Name Variance Std.Dev.
## area consonants 0.00773 0.08792
## Residual 20.10674 4.48405
## Number of obs: 451, groups: area, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.93417 0.61286 437.37534 11.314 <2e-16 ***
## consonants 0.03417 0.04630 8.68312 0.738 0.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## consonants -0.578
upsid %>%
ggplot(aes(consonants, vowels))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw()+
geom_line(data = fortify(fit4), aes(x = consonants, y = .fitted, color = area))
Если мы не предполагаем скоррелированность свободных эффектов:
fit5 <- lmer(vowels ~ consonants + (1|area) + (0+consonants|area), data = upsid)
summary(fit5)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ consonants + (1 | area) + (0 + consonants | area)
## Data: upsid
##
## REML criterion at convergence: 2643.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1630 -0.6011 -0.2275 0.4219 7.3599
##
## Random effects:
## Groups Name Variance Std.Dev.
## area (Intercept) 4.736252 2.17629
## area.1 consonants 0.008715 0.09335
## Residual 19.441470 4.40925
## Number of obs: 451, groups: area, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.70367 1.10142 5.13790 6.086 0.00157 **
## consonants 0.04515 0.04942 4.20204 0.914 0.41027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## consonants -0.347
upsid %>%
ggplot(aes(consonants, vowels))+
geom_point()+
labs(x = "number of consonants",
y = "number of vowels",
caption = "data from LAPSyD")+
theme_bw()+
geom_line(data = fortify(fit5), aes(x = consonants, y = .fitted, color = area))
fit0 <- lmer(vowels ~ 1 + (1|area) + (0+consonants|area), data = upsid)
summary(fit0)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: vowels ~ 1 + (1 | area) + (0 + consonants | area)
## Data: upsid
##
## REML criterion at convergence: 2640
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1165 -0.6101 -0.2245 0.4076 7.3753
##
## Random effects:
## Groups Name Variance Std.Dev.
## area (Intercept) 4.577349 2.13947
## area.1 consonants 0.009654 0.09825
## Residual 19.420625 4.40688
## Number of obs: 451, groups: area, 6
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 7.049 1.025 4.080 6.879 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit5, fit4, fit3, fit2, fit1, fit0)
## refitting model(s) with ML (instead of REML)