6 Text manipulations

I will still use a lot of tidyverse:

library(tidyverse)

6.1 read_lines()

If you want to read text in R you can use the read_lines() function:

tajemnica_baskerville <- read_lines("https://raw.githubusercontent.com/agricolamz/2020.02_Naumburg_R/master/data/tajemnica_baskerville.txt")

As a result you will get a vector with characters. It is easy to convert it to dataframe:

tibble(text = tajemnica_baskerville)

6.2 gutenbergr

The gutenbergr package is an API for a very old project Gutenberg, that is a library of over 60,000 free eBooks.

library(gutenbergr)

The most important part of this package is the gutenberg_metadata dataset – that is a catalogue of everything in the Gutenberg library.

str(gutenberg_metadata)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51997 obs. of  8 variables:
##  $ gutenberg_id       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ title              : chr  NA "The Declaration of Independence of the United States of America" "The United States Bill of Rights\r\nThe Ten Original Amendments to the Constitution of the United States" "John F. Kennedy's Inaugural Address" ...
##  $ author             : chr  NA "Jefferson, Thomas" "United States" "Kennedy, John F. (John Fitzgerald)" ...
##  $ gutenberg_author_id: int  NA 1638 1 1666 3 1 4 NA 3 3 ...
##  $ language           : chr  "en" "en" "en" "en" ...
##  $ gutenberg_bookshelf: chr  NA "United States Law/American Revolutionary War/Politics" "American Revolutionary War/Politics/United States Law" NA ...
##  $ rights             : chr  "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." "Public domain in the USA." ...
##  $ has_text           : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  - attr(*, "date_updated")= Date, format: "2016-05-05"

How many languages are presented in the Gutenberg library?

gutenberg_metadata %>% 
  count(language, sort = TRUE)

How many authors are available?

gutenberg_metadata %>% 
  count(author, sort = TRUE)

How many Polish texts are available?

gutenberg_metadata %>% 
  filter(language == "pl") %>% 
  count(author, sort = TRUE)

Whose texts are the most numerous in the German part of the Gutenberg library? Put his/her last name in the form.


Let’s have a look at Mickiewicz’s texts in the Polish part of the Gutenberg library:

gutenberg_metadata %>% 
  filter(author == "Mickiewicz, Adam",
         language == "pl")

Let’s download Adam Mickiewicz’s sonnets:

text <- gutenberg_download(27081)
text

It is possible to use multiple ids. Let’s also download some poems by A. Mickiewicz (1798–1855), J. Kochanowski (1530–1584), Z. Krasinski (1812–1859), and A. Oppman (1867–1931):

texts <- gutenberg_download(c(27081, 27871, 28009, 27208))
texts

Be aware:

  • texts could include something from the real book: introduction or last word written by other people, publication details, etc.
  • texts could be stored with the wrong encoding;
  • texts could be stored with normalised orthography (e. g. Kochanowski, look at rows 99 and 100);
texts %>% 
  filter(gutenberg_id == 27208)
  • there are a lot of empty characters;
  • and probably a lot of other problems.

I annotated those texts:

texts <- read_csv("https://raw.githubusercontent.com/agricolamz/2020.02_Naumburg_R/master/data/mickiewicz_kochanowski_krasinski_oppman.csv")

Now it is possible to remove some non-important lines:

texts %>% 
  filter(title != "remove") ->
  texts
texts

Calculate how many rows per author we do have in our dataset. Who has the largest amount?


6.3 tidytext

The tidytext (Silge and Robinson 2017) (this book is available online) allows you to work with texts in tidy ideology, that makes it easier to manipulate, summarise, and visualise the characteristics of texts easily and integrate natural language processing tools (sentiment analysis, tf-idf metric, n-gram analysis, topic modeling etc.).

library(tidytext)

texts %>% 
  unnest_tokens(output = "word", input = text)
texts %>% 
  unnest_tokens(output = "word", input = text) %>% 
  group_by(author) %>% 
  count(word, sort = TRUE) %>% 
  top_n(10) %>% 
  ggplot(aes(word, n))+
  geom_col()+
  coord_flip()+
  facet_wrap(~author, scales = "free")

As you see the sorting is bad. Sorting within different facets is possible with the reorder_within() function:

texts %>% 
  unnest_tokens(output = "word", input = text) %>% 
  group_by(author) %>% 
  count(word, sort = TRUE) %>% 
  top_n(10) %>% 
  ggplot(aes(reorder_within(x = word, by = n, within = author), n))+
  geom_col()+
  coord_flip()+
  facet_wrap(~author, scales = "free")

In order to remove the authors name you also need to add scale_x_reordered() layer to your ggplot:

texts %>% 
  unnest_tokens(output = "word", input = text) %>% 
  group_by(author) %>% 
  count(word, sort = TRUE) %>% 
  top_n(10) %>% 
  ggplot(aes(reorder_within(word, n, author), n))+
  geom_col()+
  scale_x_reordered()+
  coord_flip()+
  facet_wrap(~author, scales = "free")

Often in text analysis, it is useful to remove stopwords. Stop words are frequent words that mostly contain grammatic information. I will use a polish stopword-list from this repository, but you can use any other list or modify the existing one.

stopwords <- read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-pl/master/stopwords-pl.txt")
stopwords
##   [1] "a"            "aby"          "ach"          "acz"          "aczkolwiek"  
##   [6] "aj"           "albo"         "ale"          "ależ"         "ani"         
##  [11] "aż"           "bardziej"     "bardzo"       "bez"          "bo"          
##  [16] "bowiem"       "by"           "byli"         "bym"          "bynajmniej"  
##  [21] "być"          "był"          "była"         "było"         "były"        
##  [26] "będzie"       "będą"         "cali"         "cała"         "cały"        
##  [31] "chce"         "choć"         "ci"           "ciebie"       "cię"         
##  [36] "co"           "cokolwiek"    "coraz"        "coś"          "czasami"     
##  [41] "czasem"       "czemu"        "czy"          "czyli"        "często"      
##  [46] "daleko"       "dla"          "dlaczego"     "dlatego"      "do"          
##  [51] "dobrze"       "dokąd"        "dość"         "dr"           "dużo"        
##  [56] "dwa"          "dwaj"         "dwie"         "dwoje"        "dzisiaj"     
##  [61] "dziś"         "gdy"          "gdyby"        "gdyż"         "gdzie"       
##  [66] "gdziekolwiek" "gdzieś"       "go"           "godz"         "hab"         
##  [71] "i"            "ich"          "ii"           "iii"          "ile"         
##  [76] "im"           "inna"         "inne"         "inny"         "innych"      
##  [81] "inż"          "iv"           "ix"           "iż"           "ja"          
##  [86] "jak"          "jakaś"        "jakby"        "jaki"         "jakichś"     
##  [91] "jakie"        "jakiś"        "jakiż"        "jakkolwiek"   "jako"        
##  [96] "jakoś"        "je"           "jeden"        "jedna"        "jednak"      
## [101] "jednakże"     "jedno"        "jednym"       "jedynie"      "jego"        
## [106] "jej"          "jemu"         "jest"         "jestem"       "jeszcze"     
## [111] "jeśli"        "jeżeli"       "już"          "ją"           "każdy"       
## [116] "kiedy"        "kierunku"     "kilka"        "kilku"        "kimś"        
## [121] "kto"          "ktokolwiek"   "ktoś"         "która"        "które"       
## [126] "którego"      "której"       "który"        "których"      "którym"      
## [131] "którzy"       "ku"           "lat"          "lecz"         "lub"         
## [136] "ma"           "mają"         "mam"          "mamy"         "mało"        
## [141] "mgr"          "mi"           "miał"         "mimo"         "między"      
## [146] "mnie"         "mną"          "mogą"         "moi"          "moim"        
## [151] "moja"         "moje"         "może"         "możliwe"      "można"       
## [156] "mu"           "musi"         "my"           "mój"          "na"          
## [161] "nad"          "nam"          "nami"         "nas"          "nasi"        
## [166] "nasz"         "nasza"        "nasze"        "naszego"      "naszych"     
## [171] "natomiast"    "natychmiast"  "nawet"        "nic"          "nich"        
## [176] "nie"          "niech"        "niego"        "niej"         "niemu"       
## [181] "nigdy"        "nim"          "nimi"         "nią"          "niż"         
## [186] "no"           "nowe"         "np"           "nr"           "o"           
## [191] "o.o."         "obok"         "od"           "ok"           "około"       
## [196] "on"           "ona"          "one"          "oni"          "ono"         
## [201] "oraz"         "oto"          "owszem"       "pan"          "pana"        
## [206] "pani"         "pl"           "po"           "pod"          "podczas"     
## [211] "pomimo"       "ponad"        "ponieważ"     "powinien"     "powinna"     
## [216] "powinni"      "powinno"      "poza"         "prawie"       "prof"        
## [221] "przecież"     "przed"        "przede"       "przedtem"     "przez"       
## [226] "przy"         "raz"          "razie"        "roku"         "również"     
## [231] "sam"          "sama"         "się"          "skąd"         "sobie"       
## [236] "sobą"         "sposób"       "swoje"        "są"           "ta"          
## [241] "tak"          "taka"         "taki"         "takich"       "takie"       
## [246] "także"        "tam"          "te"           "tego"         "tej"         
## [251] "tel"          "temu"         "ten"          "teraz"        "też"         
## [256] "to"           "tobie"        "tobą"         "toteż"        "trzeba"      
## [261] "tu"           "tutaj"        "twoi"         "twoim"        "twoja"       
## [266] "twoje"        "twym"         "twój"         "ty"           "tych"        
## [271] "tylko"        "tym"          "tys"          "tzw"          "tę"          
## [276] "u"            "ul"           "vi"           "vii"          "viii"        
## [281] "vol"          "w"            "wam"          "wami"         "was"         
## [286] "wasi"         "wasz"         "wasza"        "wasze"        "we"          
## [291] "według"       "wie"          "wiele"        "wielu"        "więc"        
## [296] "więcej"       "wszyscy"      "wszystkich"   "wszystkie"    "wszystkim"   
## [301] "wszystko"     "wtedy"        "www"          "wy"           "właśnie"     
## [306] "wśród"        "xi"           "xii"          "xiii"         "xiv"         
## [311] "xv"           "z"            "za"           "zapewne"      "zawsze"      
## [316] "zaś"          "ze"           "zeznowu"      "znowu"        "znów"        
## [321] "został"       "zł"           "żaden"        "żadna"        "żadne"       
## [326] "żadnych"      "że"           "żeby"

So now we are ready to remove stopwords using the antijoin() function:

texts %>% 
  unnest_tokens(output = "word", input = text) %>% 
  anti_join(tibble(word = stopwords)) %>% # here is the stopwords removal
  group_by(author) %>% 
  count(word, sort = TRUE) %>% 
  top_n(10) %>% 
  ggplot(aes(reorder_within(word, n, author), n))+
  geom_col()+
  scale_x_reordered()+
  coord_flip()+
  facet_wrap(~author, scales = "free")

It is also possible to analyse bigrams

texts %>% 
  unnest_tokens(output = "bigrams", input = text, token = "ngrams", n = 2) %>% 
  # separate into two seperate columns each part of bigram
  separate(bigrams, into = c("word_1", "word_2"), sep = " ") %>%
  # filter out those that have stopwords
  anti_join(tibble(word_1 = stopwords)) %>% 
  anti_join(tibble(word_2 = stopwords)) %>% 
  # merge separate columns into one
  mutate(bigrams = str_c(word_1, word_2, sep = " ")) %>% 
  group_by(author) %>% 
  count(bigrams) %>% 
  top_n(4) %>% 
  ggplot(aes(reorder_within(bigrams, n, author), n))+
  geom_col()+
  scale_x_reordered()+
  coord_flip()+
  facet_wrap(~author, scales = "free")

Since our corpora for each author is really small we can’t see much (e. g. Mickiewicz no repetitions). If the text will be longer (e. g. long novels), you will be able to get the most important. Lets analyse “Tajemnicę Baskerville’ów”:

tajemnica <- gutenberg_download(34079)

tajemnica %>% 
  unnest_tokens(output = "bigrams", input = text, token = "ngrams", n = 2) %>% 
  # separate into two seperate columns each part of bigram
  separate(bigrams, into = c("word_1", "word_2"), sep = " ") %>%
  # filter out those that have stopwords
  anti_join(tibble(word_1 = stopwords)) %>% 
  anti_join(tibble(word_2 = stopwords)) %>% 
  # merge separate columns into one
  mutate(bigrams = str_c(word_1, word_2, sep = " ")) %>% 
  count(bigrams, sort = TRUE) %>% 
  top_n(20) %>% 
  ggplot(aes(fct_reorder(bigrams, n), n))+
  geom_col()+
  coord_flip()

Analyse “Pan Tadeusz Czyli Ostatni Zajazd na Litwie” by A. Mickiewicz. What is the most frequent bigram in this text (remove stopwords)?


6.4 udpipe

The udpipe package gives you the ability to get lemmatisation, morphological and syntactic analysis for multiple languages. A tutorial and a list of available languages can be found here.

All models are long to download:

library(udpipe)
udpipe_download_model(language = "polish-pdb")

Texts for udpipe analyser should have variables named text and doc_id:

texts %>% 
  mutate(doc_id = str_c(author, "_", title)) ->
  texts_for_udpipe
texts_for_udpipe

After you downloaded a model and created a correct dataframe it is possible to analyse our texts:

texts_parsed <- udpipe(x = texts_for_udpipe, 
                       object = udpipe_load_model("polish-pdb-ud-2.4-190531.udpipe"))
texts_parsed
texts_parsed

What is the most frequent part of speech (upos variable) in our corpora according to the model?


6.5 stylo

The stylo package (Eder, Rybicki, and Kestemont 2016) is a package for computational stylistics, authorship attribution, etc.

First of all it is possible to use pure frequencies as features for clustarisation:

library(stylo)

texts %>% 
  mutate(doc_id = str_c(author, title, sep = "_")) %>% 
  unnest_tokens(text, output = "word") %>% 
  count(doc_id, word) %>% 
  group_by(doc_id) %>% 
  mutate(ratio = n/sum(n)) %>% 
  pivot_wider(names_from = doc_id, values_from = ratio, values_fill = list(ratio = 0)) %>% 
  select(-word, -n) ->
  for_stylo
for_stylo

Then you can run a clustarisation analysis.

stylo(parsed.corpus = for_stylo, gui = FALSE, analysis.type = "CA")

You can run it with the argumment gui = TRUE, then you will see a graphical interface.

It is also possible to use lemmatised via udpipe representation.

texts_parsed %>% 
  count(doc_id, lemma) %>% 
  group_by(doc_id) %>% 
  mutate(ratio = n/sum(n)) %>% 
  pivot_wider(names_from = doc_id, values_from = ratio, values_fill = list(ratio = 0)) %>% 
  select(-lemma, -n) ->
  for_stylo_2

stylo(parsed.corpus = for_stylo_2, gui = FALSE, analysis.type = "CA")

Authorship Verification Classifier:

texts %>% 
  mutate(author = ifelse(title == "Sonety I.", "test_mickiewicz", author),
         author = ifelse(title == "Sonety VI.", "test_mickiewicz", author),
         author = ifelse(title == "Sonety XX.", "test_mickiewicz", author),
         author = ifelse(title == "Sonety krymskie XV.", "test_mickiewicz", author),
         author = ifelse(title == "Sonety krymskie III.", "test_mickiewicz", author),
         author = ifelse(title == "THREN XIX.", "test_kochanowski", author),
         author = ifelse(title == "Niewymarzona, a cudowna.", "test_krasinski", author),
         author = ifelse(title == "OWCZAREK.", "test_oppman", author),
         author = ifelse(title == "EMIGRANT.", "test_oppman", author)) %>% 
  unnest_tokens(output = "word", input = text) %>% 
  count(author, word, sort = TRUE) %>% 
  filter(n > 1) %>% 
  group_by(author) %>% 
  mutate(ratio = n/sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = word, values_from = ratio, values_fill = list(ratio = 0)) %>% 
  as.data.frame() -> # stylo package doesn't work with tibble
  for_imposter
for_imposter

Lets choose Krasiński:

imposters(reference.set = for_imposter[-c(5:8), -c(1:2)],
# for some reason this function demands minimum 2 rows in the dataframe (contra docs)
          test = for_imposter[c(5, 5), -c(1:2)],
          classes.reference.set = for_imposter[-c(5:8), 1])
##   Krasiński  Mickiewicz Kochanowski      Oppman 
##        0.00        0.00        0.35        0.92

Lets choose Kochanowski:

imposters(reference.set = for_imposter[-c(5:8), -c(1:2)],
# for some reason this function demands minimum 2 rows in the dataframe (contra docs)
          test = for_imposter[c(6, 6), -c(1:2)],
          classes.reference.set = for_imposter[-c(5:8), 1])
##   Krasiński  Mickiewicz Kochanowski      Oppman 
##        0.01        0.01        0.35        1.00

Lets choose Mickiewicz:

imposters(reference.set = for_imposter[-c(5:8), -c(1:2)],
# for some reason this function demands minimum 2 rows in the dataframe (contra docs) 
          test = for_imposter[c(7, 7), -c(1:2)],
          classes.reference.set = for_imposter[-c(5:8), 1])
##   Krasiński  Mickiewicz Kochanowski      Oppman 
##        0.05        0.00        0.26        1.00

Lets choose Oppman:

imposters(reference.set = for_imposter[-c(5:8), -c(1:2)],
# for some reason this function demands minimum 2 rows in the dataframe (contra docs)
          test = for_imposter[c(8, 8), -c(1:2)],
          classes.reference.set = for_imposter[-c(5:8), 1])
##   Krasiński  Mickiewicz Kochanowski      Oppman 
##        0.01        0.02        0.34        1.00

As you see, this algorithm did not work on this verse dataset, but in docs (?imposters()) you can find an example of analysis of the novel “Cuckoo’s Calling” by a mysterious Robert Galbraith that turned out to be J.K. Rowling.

References

Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. “Stylometry with R: A Package for Computational Text Analysis.” R Journal 8 (1): 107–21. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html.

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. "O’Reilly Media, Inc.".