purrr-map02

map

tidyverse

Published

October 24, 2022

Exercise

Bestimmen Sie die häufigsten Worte im Grundatzprogramm der Partei AfD (in der aktuellsten Version).

Solution

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Text aus PDF-Dateien kann man mit dem Paket pdftools einlesen:

library(pdftools)

Using poppler version 22.02.0

d_path <- "~/Literatur/_Div/Politik/afd-grundsatzprogramm-2022.pdf"

d <- tibble(text = pdf_text(d_path))

Dann erstellen wir eine Tidy-Version und tokenisieren nach Wörtern:

library(tidytext)
d2 <-
  d %>% 
  unnest_tokens(output = word, input = text)

head(d2)

# A tibble: 6 × 1
  word             
  <chr>            
1 programm         
2 für              
3 deutschland      
4 das              
5 grundsatzprogramm
6 der

Dann zählen wir die Wörter:

d2 %>% 
  count(word, sort = TRUE) %>% 
  head(20)

# A tibble: 20 × 2
   word            n
   <chr>       <int>
 1 die          1151
 2 und          1147
 3 der           870
 4 zu            435
 5 für           392
 6 in            392
 7 den           271
 8 von           257
 9 ist           251
10 das           225
11 werden        214
12 eine          211
13 nicht         196
14 ein           191
15 deutschland   190
16 sind          187
17 wir           176
18 afd           171
19 des           169
20 sich          158

Categories:

R
map
tidyverse