Предполагается, что Вы
Мы будем использовать полученные вчера (но несколько мною измененные) данные.
Вот наши данные на графике:
library(ggpol)
abaza %>%
distinct(word, utterance, duration, vowel) %>%
group_by(vowel, utterance) %>%
mutate(mean = mean(duration)) %>%
ggplot(aes(vowel, duration, color = vowel, fill = vowel))+
geom_boxjitter(errorbar.draw = TRUE, show.legend = FALSE)+
geom_point(aes(y = mean), color = "black", show.legend = FALSE)+
facet_wrap(~utterance)+
labs(title = "Абазинские гласные после зубных (один спикер)",
x = "",
y = "длительность (мс)")
Statistics are used much like a drunk uses a lamppost: for support, not illumination. A.E. Housman (commonly attributed to Andrew Lang)
\[t\ статистика = \frac{\bar{x}-\bar{y}}{\sqrt{\frac{\sigma^2_x}{n^2_x}+\frac{\sigma^2_y}{n^2_y}}}\]
\(\sigma\) — стандартное отклонение
Welch Two Sample t-test
data: duration by vowel
t = 15.449, df = 142.85, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
37.56369 48.58632
sample estimates:
mean in group а mean in group ы
135.40635 92.33134
Welch Two Sample t-test
data: duration by vowel
t = 15.048, df = 207.84, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
48.98043 63.74885
sample estimates:
mean in group а mean in group ы
157.2021 100.8375
Welch Two Sample t-test
data: duration by vowel
t = 13.141, df = 213.96, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
27.82926 37.65134
sample estimates:
mean in group а mean in group ы
129.00885 96.26856
Welch Two Sample t-test
data: duration by vowel
t = 12.964, df = 199.45, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
34.90737 47.43138
sample estimates:
mean in group а mean in group ы
122.34353 81.17416
library(ggpol)
abaza %>%
distinct(word, utterance, duration, vowel) %>%
group_by(vowel, utterance) %>%
mutate(mean = mean(duration)) %>%
ggplot(aes(vowel, duration, color = vowel, fill = vowel))+
geom_boxjitter(errorbar.draw = TRUE, show.legend = FALSE)+
geom_point(aes(y = mean), color = "black", show.legend = FALSE)+
facet_wrap(~utterance)+
labs(title = "Абазинские гласные после зубных (один спикер)",
x = "",
y = "длительность (мс)")
Бывают еще…
считается как соотношение внутри- и межгрупповой дисперсии.
Df Sum Sq Mean Sq F value Pr(>F)
utterance 3 18398 6133 13.44 2.47e-08 ***
Residuals 363 165693 456
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df Sum Sq Mean Sq F value Pr(>F)
utterance 3 98526 32842 52.33 <2e-16 ***
Residuals 540 338891 628
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(ggpol)
abaza %>%
distinct(word, utterance, duration, vowel) %>%
group_by(vowel, utterance) %>%
mutate(mean = mean(duration)) %>%
ggplot(aes(utterance, duration, color = utterance, fill = utterance))+
geom_boxjitter(errorbar.draw = TRUE, show.legend = FALSE)+
geom_point(aes(y = mean), color = "black", show.legend = FALSE)+
facet_wrap(~vowel)+
labs(title = "Абазинские гласные после зубных (один спикер)",
x = "",
y = "длительность (мс)")
Может быть тип гласного все же не определяет длительность? Может быть чем более открытый гласный, тем он длинее? Давайте попробуем предсказать длительность на основании F1.
\[y_i = \beta_0 + \beta_1\times x_i + \epsilon_i\]
abaza %>%
filter(!is.na(f1),
utterance == 1) %>%
group_by(word, duration, vowel) %>%
summarise(f1 = mean(f1)) ->
abaza_mean_f1
abaza_mean_f1 %>%
ggplot(aes(f1, duration, label = vowel))+
geom_text()+
geom_smooth(se = FALSE, method = "lm")+
labs(title = "Абазинские гласные после зубных (один спикер)",
y = "длительность (мс)",
x = "F1 (Hz)")
Call:
lm(formula = duration ~ f1, data = abaza_mean_f1)
Residuals:
Min 1Q Median 3Q Max
-38.325 -13.360 -3.458 13.425 48.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.15647 21.02879 1.577 0.13227
f1 0.10462 0.02761 3.789 0.00134 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.68 on 18 degrees of freedom
Multiple R-squared: 0.4437, Adjusted R-squared: 0.4128
F-statistic: 14.36 on 1 and 18 DF, p-value: 0.001343
\[\mbox{duration}_i = \beta_0 + \beta_1\times \mbox{F1}_i + \epsilon_i = \] \[\mbox{duration}_i = 41.30346 + 0.09379 \times \mbox{F1}_i + \epsilon_i \]
\[y_i = \beta_0 + \beta_1\times x_i + \epsilon_i\]
Однако x имеет два значения: 0, если гласный a, и 1, если гласный ы
abaza %>%
filter(utterance == 1) %>%
distinct(word, utterance, duration, vowel) %>%
group_by(vowel, utterance) %>%
mutate(mean = mean(duration)) %>%
ggplot(aes(vowel, duration, color = vowel, fill = vowel))+
geom_boxjitter(errorbar.draw = TRUE, show.legend = FALSE)+
geom_point(aes(y = mean), color = "black", show.legend = FALSE)+
labs(title = "Абазинские гласные после зубных (один спикер)",
x = "",
y = "длительность (мс)")
Call:
lm(formula = duration ~ vowel, data = abaza[abaza$utterance ==
1, ])
Residuals:
Min 1Q Median 3Q Max
-33.934 -13.674 -2.703 11.405 39.799
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.406 1.624 83.35 <2e-16 ***
vowelы -43.075 2.577 -16.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.08 on 227 degrees of freedom
Multiple R-squared: 0.5517, Adjusted R-squared: 0.5498
F-statistic: 279.4 on 1 and 227 DF, p-value: < 2.2e-16
\[\mbox{duration}_i = \beta_0 + \beta_1\times \mbox{F1}_i + \epsilon_i = \] \[\mbox{duration}_i = 135.406 - 43.075 \times \mbox{F1}_i + \epsilon_i \]
\[y_i = \beta_0 + \beta_1\times x_i + ... \beta_j\times z_i + \epsilon_i\]
abaza %>%
filter(!is.na(f1),
utterance == 1) %>%
group_by(word, duration, vowel) %>%
summarise(f1 = mean(f1)) ->
abaza_mean_f1
summary(lm(duration ~ f1+vowel, data = abaza_mean_f1))
Call:
lm(formula = duration ~ f1 + vowel, data = abaza_mean_f1)
Residuals:
Min 1Q Median 3Q Max
-21.729 -12.376 -7.074 11.372 47.513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 249.73541 66.40356 3.761 0.00156 **
f1 -0.12614 0.07191 -1.754 0.09741 .
vowelы -92.96734 27.58193 -3.371 0.00363 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.87 on 17 degrees of freedom
Multiple R-squared: 0.6666, Adjusted R-squared: 0.6273
F-statistic: 16.99 on 2 and 17 DF, p-value: 8.824e-05
\[\mbox{duration} = \beta_0 + \beta_1\times \mbox{f1} + \beta_2\times \mbox{vowel_ы} + \epsilon_i = \] \[\mbox{duration} = 249.73541 + -0.12614 \times \mbox{f1} - 92.96734\times \mbox{vowel_ы} + \epsilon_i \]
Two Sample t-test
data: duration by vowel
t = 16.715, df = 227, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
37.99716 48.15285
sample estimates:
mean in group а mean in group ы
135.40635 92.33134
Df Sum Sq Mean Sq F value Pr(>F)
vowel 1 101750 101750 279.4 <2e-16 ***
Residuals 227 82667 364
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Одинаковые p-value.
Two Sample t-test
data: duration by vowel
t = 16.715, df = 227, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
37.99716 48.15285
sample estimates:
mean in group а mean in group ы
135.40635 92.33134
Call:
lm(formula = duration ~ vowel, data = abaza[abaza$utterance ==
1, ])
Residuals:
Min 1Q Median 3Q Max
-33.934 -13.674 -2.703 11.405 39.799
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.406 1.624 83.35 <2e-16 ***
vowelы -43.075 2.577 -16.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.08 on 227 degrees of freedom
Multiple R-squared: 0.5517, Adjusted R-squared: 0.5498
F-statistic: 279.4 on 1 and 227 DF, p-value: < 2.2e-16
Одинаковые t-статистики и p-value.
Df Sum Sq Mean Sq F value Pr(>F)
vowel 1 101750 101750 279.4 <2e-16 ***
Residuals 227 82667 364
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = duration ~ vowel, data = abaza[abaza$utterance ==
1, ])
Residuals:
Min 1Q Median 3Q Max
-33.934 -13.674 -2.703 11.405 39.799
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.406 1.624 83.35 <2e-16 ***
vowelы -43.075 2.577 -16.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.08 on 227 degrees of freedom
Multiple R-squared: 0.5517, Adjusted R-squared: 0.5498
F-statistic: 279.4 on 1 and 227 DF, p-value: < 2.2e-16
Одинаковые F-статистики и p-value.
… а еще бывает Байесовская статистика…