germeval01-textfeatures

2023
textmining
datawrangling
germeval
string
Published

November 16, 2023

Aufgabe

Extrahieren Sie typisches Text-Features aus einem Text.

Nutzen Sie das Paket textfeatures.

Nutzen Sie die GermEval-2018-Daten.

Die Daten sind unter CC-BY-4.0 lizensiert. Author: Wiegand, Michael (Spoken Language Systems, Saarland University (2010-2018), Leibniz Institute for the German Language (since 2019)),

Die Daten sind auch über das R-Paket PradaData zu beziehen.

library(tidyverse)
library(easystats)
data("germeval_train", package = "pradadata")

Nutzen Sie diesen Text-Datensatz, bevor Sie den größeren germeval-Datensatz verwenden:

Daten

Teststring:

text <- c("Abbau, Abbruch ist jetzt", 
          "Test   🧑‍🎓 😄 heute!!", 
          "Abbruch #morgen #perfekt", 
          "Abmachung... LORE IPSUM", 
          "boese ja", "böse nein", 
          "hallo ?! www.google.de", 
          "gut schlecht I am you are he she it is")

n_emo <- c(2, 0, 2, 1, 1, 1, 0, 2)

test_text <-
  data.frame(id = 1:length(text),
         text = text,
         n_emo = n_emo)

test_text
  id                     text n_emo
1  1 Abbau, Abbruch ist jetzt     2
2  2   Test   🧑‍🎓 😄 heute!!     0
3  3 Abbruch #morgen #perfekt     2
 [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

Hinweise:











Lösung

Das Paket textfeatures ist aktuelle nicht auf CRAN, aber über Github zu bekommen (oder im CRAN-Archiv).

library(tidyverse)
library(tictoc)
library(textfeatures)

Test 1

Hier ein Test vom Autor des Pakets:

x <- c(
  "this is A!\t sEntence https://github.com about #rstats @github",
  "and another sentence here", "THe following list:\n- one\n- two\n- three\nOkay!?!"
)

## get text features
textfeatures::textfeatures(x, verbose = FALSE)
# A tibble: 3 × 36
  n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
   <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>
1  1.15      1.15       1.15          1.15       1.15          1.15    0.243
2 -0.577    -0.577     -0.577        -0.577     -0.577        -0.577  -1.10 
3 -0.577    -0.577     -0.577        -0.577     -0.577        -0.577   0.856
# ℹ 29 more variables: n_uq_chars <dbl>, n_commas <dbl>, n_digits <dbl>,
#   n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>,
#   n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
#   n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>,
#   sent_afinn <dbl>, sent_bing <dbl>, sent_syuzhet <dbl>, sent_vader <dbl>,
#   n_polite <dbl>, n_first_person <dbl>, n_first_personp <dbl>,
#   n_second_person <dbl>, n_second_personp <dbl>, n_third_person <dbl>, …

Test 2

textfeatures::textfeatures(test_text$text,
                           sentiment = FALSE,
                           word_dims = FALSE)
↪ Counting features in text...
↪ Parts of speech...
↪ Word dimensions started
↪ Normalizing data
✔ Job's done!
# A tibble: 8 × 29
  n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
   <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>
1      0         0     -0.354        -0.354          0             0  0.532 
2      0         0     -0.354        -0.354          0             0  0.0800
3      0         0      2.47          2.47           0             0  0.589 
4      0         0     -0.354        -0.354          0             0  0.532 
5      0         0     -0.354        -0.354          0             0 -1.86  
6      0         0     -0.354        -0.354          0             0 -1.25  
7      0         0     -0.354        -0.354          0             0  0.471 
8      0         0     -0.354        -0.354          0             0  0.910 
# ℹ 22 more variables: n_uq_chars <dbl>, n_commas <dbl>, n_digits <dbl>,
#   n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>,
#   n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
#   n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>,
#   n_first_person <dbl>, n_first_personp <dbl>, n_second_person <dbl>,
#   n_second_personp <dbl>, n_third_person <dbl>, n_tobe <dbl>,
#   n_prepositions <dbl>

Categories:

  • 2023
  • textmining
  • datawrangling
  • germeval
  • string