germeval-textfeatures01

tidymodels
textmining
prediction
sentiment
germeval
string
Published

November 16, 2023

Aufgabe

Extrahieren Sie gängige Textfeatures - mit Hilfe des gleichnamigen R-Pakets - als Teil des Feature Engineering im Rahmen eines Tidymodels-Klassifikationsmodells.

Modellieren Sie dann mit einem einfachen linearen Modell die abhängige Variable.

Verwenden Sie diesen Datensatz:

Die AV ist c1.

Hinweise:











Lösung

Setup

Daten

c2 brauchen wir hier nicht:

Rezept

Rezept definieren:

step_mutate ergänzt für die erzeugte (mutierte) Variable automatisch eine Rolle im Rezept, nimmt sie also als Prädiktor auf.

Mal schauen:

# A tibble: 1 × 6
  number operation type        trained skip  id               
   <int> <chr>     <chr>       <lgl>   <lgl> <chr>            
1      1 step      textfeature FALSE   FALSE textfeature_OUeIy

Preppen und backen:

6.321 sec elapsed
# A tibble: 6 × 29
     id c1      textfeature_text_n_words textfeature_text_n_uq_words
  <int> <fct>                      <int>                       <int>
1     1 OTHER                         15                          15
2     2 OTHER                         19                          19
3     3 OTHER                         11                          10
4     4 OTHER                         19                          18
5     5 OFFENSE                       16                          16
6     6 OTHER                         44                          39
# ℹ 25 more variables: textfeature_text_n_charS <int>,
#   textfeature_text_n_uq_charS <int>, textfeature_text_n_digits <int>,
#   textfeature_text_n_hashtags <int>, textfeature_text_n_uq_hashtags <int>,
#   textfeature_text_n_mentions <int>, textfeature_text_n_uq_mentions <int>,
#   textfeature_text_n_commas <int>, textfeature_text_n_periods <int>,
#   textfeature_text_n_exclaims <int>, textfeature_text_n_extraspaces <int>,
#   textfeature_text_n_caps <int>, textfeature_text_n_lowers <int>, …

Folgende Spalten/Features hat step_textfeatures extrahiert:

 [1] "id"                              "c1"                             
 [3] "textfeature_text_n_words"        "textfeature_text_n_uq_words"    
 [5] "textfeature_text_n_charS"        "textfeature_text_n_uq_charS"    
 [7] "textfeature_text_n_digits"       "textfeature_text_n_hashtags"    
 [9] "textfeature_text_n_uq_hashtags"  "textfeature_text_n_mentions"    
[11] "textfeature_text_n_uq_mentions"  "textfeature_text_n_commas"      
[13] "textfeature_text_n_periods"      "textfeature_text_n_exclaims"    
[15] "textfeature_text_n_extraspaces"  "textfeature_text_n_caps"        
[17] "textfeature_text_n_lowers"       "textfeature_text_n_urls"        
[19] "textfeature_text_n_uq_urls"      "textfeature_text_n_nonasciis"   
[21] "textfeature_text_n_puncts"       "textfeature_text_politeness"    
[23] "textfeature_text_first_person"   "textfeature_text_first_personp" 
[25] "textfeature_text_second_person"  "textfeature_text_second_personp"
[27] "textfeature_text_third_person"   "textfeature_text_to_be"         
[29] "textfeature_text_prepositions"  

Model

Workflow

Fit

5.78 sec elapsed
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_textfeature()

── Model ───────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
                    (Intercept)         textfeature_text_n_words  
                        1.50724                          0.05024  
    textfeature_text_n_uq_words         textfeature_text_n_charS  
                       -0.05456                         -1.08311  
    textfeature_text_n_uq_charS        textfeature_text_n_digits  
                       -0.01885                          1.12019  
    textfeature_text_n_hashtags   textfeature_text_n_uq_hashtags  
                        0.43821                         -0.30226  
    textfeature_text_n_mentions   textfeature_text_n_uq_mentions  
                       -0.08228                          0.14038  
      textfeature_text_n_commas       textfeature_text_n_periods  
                        1.23295                          1.07770  
    textfeature_text_n_exclaims   textfeature_text_n_extraspaces  
                        0.79465                         -0.20735  
        textfeature_text_n_caps        textfeature_text_n_lowers  
                        1.04501                          1.08349  
        textfeature_text_n_urls       textfeature_text_n_uq_urls  
                             NA                               NA  
   textfeature_text_n_nonasciis        textfeature_text_n_puncts  
                             NA                          1.09470  
    textfeature_text_politeness    textfeature_text_first_person  
                             NA                               NA  
 textfeature_text_first_personp   textfeature_text_second_person  
                             NA                               NA  
textfeature_text_second_personp    textfeature_text_third_person  
                             NA                               NA  
         textfeature_text_to_be    textfeature_text_prepositions  
                             NA                               NA  

Degrees of Freedom: 5008 Total (i.e. Null);  4992 Residual
Null Deviance:      6402 
Residual Deviance: 6100     AIC: 6134

Test-Set-Güte

Vorhersagen im Test-Set:

2.28 sec elapsed

Und die Vorhersagen zum Test-Set hinzufügen, damit man TRUTH und ESTIMATE vergleichen kann:

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary        0.673 
2 kap      binary        0.0800

Baseline

Ein einfaches Referenzmodell ist, einfach die häufigste Kategorie vorherzusagen:

# A tibble: 2 × 2
  c1          n
  <chr>   <int>
1 OFFENSE  1688
2 OTHER    3321

Categories:

  • tidymodels
  • textmining
  • prediction
  • sentimentanalysis
  • germeval
  • string