tidymodels-error1introd

tidymodels

statlearning

error

string

Published

November 15, 2023

Aufgabe

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.5.0     ✔ tidyr        1.3.1
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.3.0     ✔ workflows    1.1.3
✔ parsnip      1.2.0     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.3.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(tictoc)

# Data:
d_path <- "https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv"
d <- read.csv(d_path)

Die folgende Pipeline hat einen Fehler. Welcher ist das?

set.seed(42)
d_split <- initial_split(d)
d_train <- training(d_split)
d_test <- testing(d_split)


# model:
mod1 <-
  rand_forest(mode = "regression")


# cv:
set.seed(42)
rsmpl <- vfold_cv(d_train)


# recipe:
rec1 <- recipe(body_mass_g ~  ., data = d_train) |> 
  #step_unknown(all_nominal_predictors(), new_level = "NA") |> 
  #step_novel(all_nominal_predictors()) |> 
  step_naomit(all_predictors()) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_nzv(all_predictors()) |> 
  step_normalize(all_predictors()) 



# workflow:
wf1 <-
  workflow() %>% 
  add_model(mod1) %>% 
  add_recipe(rec1)


# fitting:
tic()
wf1_fit <-
  wf1 %>% 
  fit(data = d_train)
toc()

0.256 sec elapsed

preds <- predict(wf1_fit, new_data = d_test)

Error: Missing data in columns: bill_length_mm, bill_depth_mm, flipper_length_mm.

Als Check: Das gepreppte/bebackene Rezept:

rec1_prepped <- prep(rec1)
d_train_baked <- bake(rec1_prepped, new_data = NULL)

d_train_baked |> 
  head()

# A tibble: 6 × 12
  rownames bill_length_mm bill_depth_mm flipper_length_mm    year body_mass_g
     <dbl>          <dbl>         <dbl>             <dbl>   <dbl>       <int>
1   -1.24          -1.53          0.386            -0.794 -1.29          3450
2    1.45           1.32          0.386            -0.365  1.14          3675
3   -0.212          0.401        -1.97              0.707 -1.29          4500
4   -0.993          0.343         0.887            -0.294 -0.0757        4150
5    0.530          0.879        -0.566             2.07  -0.0757        5800
6   -0.281         -0.957         0.787            -1.15   1.14          3650
# ℹ 6 more variables: species_Chinstrap <dbl>, species_Gentoo <dbl>,
#   island_Dream <dbl>, island_Torgersen <dbl>, sex_female <dbl>,
#   sex_male <dbl>

d_train_baked |> 
  map_int(~ sum(is.na(.)))

         rownames    bill_length_mm     bill_depth_mm flipper_length_mm 
                0                 0                 0                 0 
             year       body_mass_g species_Chinstrap    species_Gentoo 
                0                 0                 0                 0 
     island_Dream  island_Torgersen        sex_female          sex_male 
                0                 0                 0                 0

Lösung

Der Fehler liegt darin, dass das Rezept keine Änderungen an der AV ausführt. In der AV gibt es aber fehlende Werte (NA) im Test-Set.

colSums(is.na(d_test))

         rownames           species            island    bill_length_mm 
                0                 0                 0                 1 
    bill_depth_mm flipper_length_mm       body_mass_g               sex 
                1                 1                 1                 0 
             year 
                0

Einen fehlenden Wert, um genau zu sein. Dieser eine fehlende Wert versalzt uns die Suppe:

d_test_nona <-
  d_test |> 
  na.omit()

Und schon geht’s.

preds <- predict(wf1_fit, new_data = d_test_nona) 
preds |> 
  head()

# A tibble: 6 × 1
  .pred
  <dbl>
1 3952.
2 3675.
3 3615.
4 3806.
5 3490.
6 3390.

Dieser SO-Post handelt von einem vergleichbarem Problem.

Categories:

tidymodels
statlearning
error
NA
string