tidymodels-remove-na

tidymodels
statlearning
template
string
Published

November 15, 2023

Aufgabe

Erstellen Sie ein Rezept, dass die fehlenden Werte aus dem Datensatz penguins entfernt.

Hinweise:

  • Verwenden Sie tidymodels.
  • Verwenden Sie Standardwerte, wo nicht anders angegeben.
  • Fixieren Sie Zufallszahlen auf den Startwert 42.











Lösung

# Setup:
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.3     ✔ tibble       3.2.1
✔ ggplot2      3.4.4     ✔ tidyr        1.3.0
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ lubridate 1.9.3     ✔ stringr   1.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tictoc)  # Zeitmessung



# Data:
d_path <- "https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv"
d <- read_csv(d_path)
Rows: 344 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (6): rownames, bill_length_mm, bill_depth_mm, flipper_length_mm, body_ma...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# recipe:
rec1 <- recipe(body_mass_g ~  ., data = d) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_predictors()) |> 
  step_naomit(all_predictors()) 

Als Check: Das gepreppte/bebackene Rezept:

rec1_prepped <- prep(rec1)
Warning: There are new levels in a factor: NA
d_train_baked <- bake(rec1_prepped, new_data = NULL)
d_train_baked |> 
  head()
# A tibble: 6 × 11
  rownames bill_length_mm bill_depth_mm flipper_length_mm  year body_mass_g
     <dbl>          <dbl>         <dbl>             <dbl> <dbl>       <dbl>
1    -1.72         -0.883         0.784            -1.42  -1.26        3750
2    -1.71         -0.810         0.126            -1.06  -1.26        3800
3    -1.70         -0.663         0.430            -0.421 -1.26        3250
4    -1.68         -1.32          1.09             -0.563 -1.26        3450
5    -1.67         -0.847         1.75             -0.776 -1.26        3650
6    -1.66         -0.920         0.329            -1.42  -1.26        3625
# ℹ 5 more variables: species_Chinstrap <dbl>, species_Gentoo <dbl>,
#   island_Dream <dbl>, island_Torgersen <dbl>, sex_male <dbl>
library(easystats)
# Attaching packages: easystats 0.6.0 (red = needs update)
✔ bayestestR  0.13.1   ✔ correlation 0.8.4 
✔ datawizard  0.9.0    ✔ effectsize  0.8.6 
✔ insight     0.19.6   ✔ modelbased  0.8.6 
✔ performance 0.10.8   ✔ parameters  0.21.3
✔ report      0.5.7    ✖ see         0.8.0 

Restart the R-Session and update packages in red with `easystats::easystats_update()`.
describe_distribution(d_train_baked)
Variable          |      Mean |     SD |     IQR |              Range | Skewness | Kurtosis |   n | n_Missing
-------------------------------------------------------------------------------------------------------------
rownames          |      0.02 |   0.99 |    1.71 |      [-1.72, 1.72] |     0.01 |    -1.19 | 333 |         0
bill_length_mm    |      0.01 |   1.00 |    1.69 |      [-2.17, 2.87] |     0.05 |    -0.88 | 333 |         0
bill_depth_mm     |  6.94e-03 |   1.00 |    1.57 |      [-2.05, 2.20] |    -0.15 |    -0.89 | 333 |         0
flipper_length_mm |  3.68e-03 |   1.00 |    1.64 |      [-2.06, 2.14] |     0.36 |    -0.96 | 333 |         0
year              |      0.02 |   0.99 |    2.44 |      [-1.26, 1.19] |    -0.08 |    -1.48 | 333 |         0
body_mass_g       |   4207.06 | 805.22 | 1237.50 | [2700.00, 6300.00] |     0.47 |    -0.73 | 333 |         0
species_Chinstrap |      0.02 |   1.01 |    0.00 |      [-0.50, 2.01] |     1.47 |     0.17 | 333 |         0
species_Gentoo    | -6.46e-03 |   1.00 |    2.08 |      [-0.75, 1.33] |     0.60 |    -1.65 | 333 |         0
island_Dream      |      0.02 |   1.01 |    2.08 |      [-0.75, 1.33] |     0.54 |    -1.71 | 333 |         0
island_Torgersen  |     -0.03 |   0.97 |    0.00 |      [-0.42, 2.37] |     2.07 |     2.30 | 333 |         0
sex_male          |  8.40e-17 |   1.00 |    2.00 |      [-1.01, 0.99] |    -0.02 |    -2.01 | 333 |         0

Categories:

  • tidymodels
  • statlearning
  • template
  • string