germeval03-sent-wordvec-xgb

textmining
datawrangling
germeval
prediction
tidymodels
sentiment
string
xgb
Published

December 1, 2023

Aufgabe

Erstellen Sie ein prädiktives Modell für Textdaten. Nutzen Sie Sentiments und TextFeatures im Rahmen von Feature-Engineering. Nutzen Sie außerdem deutsche Word-Vektoren für das Feature-Engineering.

Als Lernalgorithmus verwenden Sie XGB.

Verwenden Sie die GermEval-2018-Daten.

Die Daten sind unter CC-BY-4.0 lizensiert. Author: Wiegand, Michael (Spoken Language Systems, Saarland University (2010-2018), Leibniz Institute for the German Language (since 2019)),

Die Daten sind auch über das R-Paket PradaData zu beziehen.

library(tidyverse)
data("germeval_train", package = "pradadata")
data("germeval_test", package = "pradadata")

Die AV lautet c1. Die (einzige) UV lautet: text.

Hinweise:

  • Orientieren Sie sich im Übrigen an den allgemeinen Hinweisen des Datenwerks.
  • Nutzen Sie Tidymodels.
  • Nutzen Sie das sentiws Lexikon.
  • ❗ Achten Sie darauf, die Variable c2 zu entfernen bzw. nicht zu verwenden.











Lösung

d_train <-
  germeval_train |> 
  select(id, c1, text)
library(tictoc)
library(tidymodels)
library(syuzhet)
library(beepr)
library(lobstr)  # object size
data("sentiws", package = "pradadata")

Eine Vorlage für ein Tidymodels-Pipeline findet sich hier.

Learner/Modell

mod <-
  boost_tree(mode = "classification",
             learn_rate = .01, 
             tree_depth = 5
             )

Rezept

Pfad zu den Wordvecktoren:

path_wordvec <- "/Users/sebastiansaueruser/datasets/word-embeddings/wikipedia2vec/part-0.arrow"
source("https://raw.githubusercontent.com/sebastiansauer/Datenwerk2/main/funs/def_recipe_wordvec_senti.R")

rec <- def_recipe_wordvec_senti(data_train = d_train,
                                path_wordvec = path_wordvec)

Workflow

source("https://raw.githubusercontent.com/sebastiansauer/Datenwerk2/main/funs/def_df.R")
wf <- def_wf()

wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
13 Recipe Steps

• step_text_normalization()
• step_mutate()
• step_mutate()
• step_mutate()
• step_mutate()
• step_textfeature()
• step_tokenize()
• step_stopwords()
• step_word_embeddings()
• step_zv()
• ...
• and 3 more steps.

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  tree_depth = 5
  learn_rate = 0.01

Computational engine: xgboost 

Check

tic()
rec_prepped <- prep(rec)
toc()
67.325 sec elapsed
rec_prepped
obj_size(rec_prepped)
3.17 GB

Groß!

tidy(rec_prepped)
# A tibble: 13 × 6
   number operation type               trained skip  id                      
    <int> <chr>     <chr>              <lgl>   <lgl> <chr>                   
 1      1 step      text_normalization TRUE    FALSE text_normalization_QTRCS
 2      2 step      mutate             TRUE    FALSE mutate_z4zTn            
 3      3 step      mutate             TRUE    FALSE mutate_bjCuT            
 4      4 step      mutate             TRUE    FALSE mutate_OVxpj            
 5      5 step      mutate             TRUE    FALSE mutate_TRK3c            
 6      6 step      textfeature        TRUE    FALSE textfeature_6BkkC       
 7      7 step      tokenize           TRUE    FALSE tokenize_csz3N          
 8      8 step      stopwords          TRUE    FALSE stopwords_HU9cX         
 9      9 step      word_embeddings    TRUE    FALSE word_embeddings_2ZNxu   
10     10 step      zv                 TRUE    FALSE zv_FNUiA                
11     11 step      normalize          TRUE    FALSE normalize_bOlig         
12     12 step      impute_mean        TRUE    FALSE impute_mean_kRaUZ       
13     13 step      mutate             TRUE    FALSE mutate_PpudL            
d_rec_baked <- bake(rec_prepped, new_data = NULL)

head(d_rec_baked)
# A tibble: 6 × 121
     id c1      emo_count schimpf_count emoji_count textfeature_text_copy_n_wo…¹
  <dbl> <fct>       <dbl>         <dbl>       <dbl>                        <dbl>
1     1 OTHER       0.575        -0.450      -0.353                      -0.495 
2     2 OTHER      -1.11         -0.450      -0.353                      -0.0874
3     3 OTHER       0.186        -0.450       0.774                      -0.903 
4     4 OTHER       0.202        -0.450      -0.353                      -0.0874
5     5 OFFENSE     0.168        -0.450      -0.353                      -0.393 
6     6 OTHER      -1.12         -0.450      -0.353                       2.46  
# ℹ abbreviated name: ¹​textfeature_text_copy_n_words
# ℹ 115 more variables: textfeature_text_copy_n_uq_words <dbl>,
#   textfeature_text_copy_n_charS <dbl>,
#   textfeature_text_copy_n_uq_charS <dbl>,
#   textfeature_text_copy_n_digits <dbl>,
#   textfeature_text_copy_n_hashtags <dbl>,
#   textfeature_text_copy_n_uq_hashtags <dbl>, …
sum(is.na(d_rec_baked))
[1] 0
obj_size(d_rec_baked)
4.85 MB

Fit

tic()
fit_wordvec_senti_xgb <-
  fit(wf,
      data = d_train)
toc()
35.314 sec elapsed
beep()
fit_wordvec_senti_xgb
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
13 Recipe Steps

• step_text_normalization()
• step_mutate()
• step_mutate()
• step_mutate()
• step_mutate()
• step_textfeature()
• step_tokenize()
• step_stopwords()
• step_word_embeddings()
• step_zv()
• ...
• and 3 more steps.

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 42.4 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.01, max_depth = 5, gamma = 0, 
    colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
  eta = "0.01", max_depth = "5", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 119 
niter: 15
nfeatures : 119 
evaluation_log:
    iter training_logloss
       1        0.6904064
       2        0.6877236
---                      
      14        0.6590144
      15        0.6568817

Objekt-Größe:

lobstr::obj_size(fit_wordvec_senti_xgb)
3.17 GB

Groß!

Wie wir gesehen haben, ist das Rezept riesig.

library(butcher)
out <- butcher(fit_wordvec_senti_xgb)
lobstr::obj_size(out)
3.16 GB

Test-Set-Güte

Vorhersagen im Test-Set:

tic()
preds <-
  predict(fit_wordvec_senti_xgb, new_data = germeval_test)
toc()
22.669 sec elapsed

Und die Vorhersagen zum Test-Set hinzufügen, damit man TRUTH und ESTIMATE vergleichen kann:

d_test <-
  germeval_test |> 
  bind_cols(preds) |> 
  mutate(c1 = as.factor(c1))
my_metrics <- metric_set(accuracy, f_meas)
my_metrics(d_test,
           truth = c1,
           estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.689
2 f_meas   binary         0.400