tidymodels-tree2

statlearning
trees
tidymodels
speed
string
Published

November 8, 2023

Aufgabe

Berechnen Sie folgendes einfache Modell:

  1. Entscheidungsbaum

Modellformel: am ~ . (Datensatz mtcars)

Hier geht es darum, die Geschwindigkeit (und den Ressourcenverbrauch) beim Fitten zu verringern. Benutzen Sie dazu folgende Methoden

  • Verwenden mehrerer Prozesskerne

Hinweise:

  • Tunen Sie alle Parameter (die der Engine anbietet).
  • Verwenden Sie Defaults, wo nicht anders angegeben.
  • Führen Sie eine \(v=2\)-fache Kreuzvalidierung durch (weil die Stichprobe so klein ist).
  • Beachten Sie die üblichen Hinweise.











Lösung

Setup

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.3     ✔ tibble       3.2.1
✔ ggplot2      3.4.4     ✔ tidyr        1.3.0
✔ infer        1.0.5     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
data(mtcars)
library(tictoc)  # Zeitmessung
library(doParallel)  # Nutzen mehrerer Kerne
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loading required package: iterators
Loading required package: parallel

Für Klassifikation verlangt Tidymodels eine nominale AV, keine numerische:

mtcars <-
  mtcars %>% 
  mutate(am = factor(am))

Daten teilen

set.seed(42)
d_split <- initial_split(mtcars)
d_train <- training(d_split)
d_test <- testing(d_split)

Modell(e)

mod_tree <-
  decision_tree(mode = "classification",
                cost_complexity = tune(),
                tree_depth = tune(),
                min_n = tune())

Rezept(e)

rec_plain <- 
  recipe(am ~ ., data = d_train)

Resampling

set.seed(42)
rsmpl <- vfold_cv(d_train, v = 2)

Workflows

wf_tree <-
  workflow() %>%  
  add_recipe(rec_plain) %>% 
  add_model(mod_tree)

Tuning/Fitting

Tuninggrid:

tune_grid <- grid_regular(extract_parameter_set_dials(mod_tree), levels = 5)
tune_grid
# A tibble: 125 × 3
   cost_complexity tree_depth min_n
             <dbl>      <int> <int>
 1    0.0000000001          1     2
 2    0.0000000178          1     2
 3    0.00000316            1     2
 4    0.000562              1     2
 5    0.1                   1     2
 6    0.0000000001          4     2
 7    0.0000000178          4     2
 8    0.00000316            4     2
 9    0.000562              4     2
10    0.1                   4     2
# ℹ 115 more rows

Ohne Parallelisierung

tic()
fit_tree <-
  tune_grid(object = wf_tree,
            grid = tune_grid,
            metrics = metric_set(roc_auc),
            resamples = rsmpl)
→ A | warning: 21 samples were requested but there were 12 rows in the data. 12 will be used.
There were issues with some computations   A: x1
There were issues with some computations   A: x3
→ B | warning: 30 samples were requested but there were 12 rows in the data. 12 will be used.
There were issues with some computations   A: x3
There were issues with some computations   A: x25   B: x7
→ C | warning: 40 samples were requested but there were 12 rows in the data. 12 will be used.
There were issues with some computations   A: x25   B: x7
There were issues with some computations   A: x25   B: x25   C: x10
There were issues with some computations   A: x26   B: x25   C: x25
There were issues with some computations   A: x35   B: x25   C: x25
There were issues with some computations   A: x50   B: x38   C: x25
There were issues with some computations   A: x50   B: x50   C: x39
There were issues with some computations   A: x50   B: x50   C: x50
toc()
23.317 sec elapsed

ca. 45 sec. auf meinem Rechner (4-Kerne-MacBook Pro 2020).

Mit Parallelisierung

Wie viele CPUs hat mein Computer?

parallel::detectCores(logical = FALSE)
[1] 4

Parallele Verarbeitung starten:

cl <- makePSOCKcluster(4)  # Create 4 clusters
registerDoParallel(cl)
tic()
fit_tree2 <-
  tune_grid(object = wf_tree,
            grid = tune_grid,
            metrics = metric_set(roc_auc),
            resamples = rsmpl)
toc()
12.936 sec elapsed

ca. 17 Sekunden - deutlich schneller!


Categories:

  • statlearning
  • trees
  • tidymodels
  • speed
  • string