purrr-map03

map

tidyverse

Published

October 24, 2022

Exercise

Importieren Sie das Grundatzprogramm der Partei AfD (in der aktuellsten Version). Tokenisieren Sie nach Sätzen. Dann entfernen Sie alle Zahlen. Dann zählen Sie die Anzahl der Wörter pro Satz und berichten gängige deskriptive Statistiken dazu.

Solution

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Text aus PDF-Dateien kann man mit dem Paket pdftools einlesen:

library(pdftools)

Using poppler version 22.02.0

d_path <- "~/Literatur/_Div/Politik/afd-grundsatzprogramm-2022.pdf"

d <- tibble(text = pdf_text(d_path))

Dann erstellen wir eine Tidy-Version und tokenisieren nach Sätzen:

library(tidytext)
d2 <-
  d %>% 
  unnest_sentences(output = word, input = text)

head(d2)

# A tibble: 6 × 1
  word                                                                          
  <chr>                                                                         
1 programm für deutschland.                                                     
2 das grundsatzprogramm der alternative für deutschland.                        
3 2   programm für deutschland | inhalt         präambel                       …
4 familien stärken        43             und parteiferne rechnungshöfe         …
5 3   programm für deutschland | inhalt         7 | kultur, sprache und identit…
6 förder- und                         10.10.3 deutsche literatur im inland digi…

Dann entfernen wir die Zahlen:

d3 <- 
  d2 %>% 
  mutate(word = str_remove_all(word, pattern = "[:digit:]+"))

Prüfen wir, ob es geklappt hat:

d2$word[10]

[1] "weniger subventionen    88      13.7 fischerei, forst und jagd: im einklang mit der natur     88      13.8 flächenkonkurrenz:           nicht zu lasten der land- und forstwirtschaft            88"

d3$word[10]

[1] "weniger subventionen          . fischerei, forst und jagd: im einklang mit der natur           . flächenkonkurrenz:           nicht zu lasten der land- und forstwirtschaft            "

Ok.

Dann zählen wir die Wörter pro Satz:

d4 <- 
  d3 %>% 
  summarise(word_count_per_sentence = str_count(word, "\\w+"))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

head(d4)

# A tibble: 6 × 1
  word_count_per_sentence
                    <int>
1                       3
2                       6
3                     196
4                      40
5                     254
6                      15

Visualisierung:

d4 %>% 
  ggplot(aes(x = word_count_per_sentence)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(easystats)

# Attaching packages: easystats 0.6.0 (red = needs update)
✔ bayestestR  0.13.1   ✔ correlation 0.8.4 
✔ datawizard  0.9.0    ✔ effectsize  0.8.6 
✔ insight     0.19.6   ✔ modelbased  0.8.6 
✔ performance 0.10.8   ✔ parameters  0.21.3
✔ report      0.5.7    ✖ see         0.8.0 

Restart the R-Session and update packages in red with `easystats::easystats_update()`.

describe_distribution(d4)

Variable                |  Mean |    SD | IQR |          Range | Skewness | Kurtosis |    n | n_Missing
-------------------------------------------------------------------------------------------------------
word_count_per_sentence | 21.86 | 17.24 |  19 | [0.00, 254.00] |     3.84 |    37.52 | 1208 |         0

Categories:

R
map
tidyverse