DecisionTrees
Decision trees - Arbori de decizie
Problema if-uri
Task: De implementat un algoritm cu if-uri conform schemei de mai sus
Ș### Problema Decision trees
(e imagine aici muahahahaha)
TODO:

Clasificarea setului de date despre ciuperci - practica
Acest set de date include descrierile eșantioanelor ipotetice corespunzătoare a 23 de specii de ciuperci. Fiecare specie este identificată ca fiind definitiv comestibilă, definitiv otrăvitoare sau de comestibilitate necunoscută și nu este recomandată. Această ultimă clasă a fost combinată cu cea otrăvitoare. Ghidul precizează clar că nu există o regulă simplă pentru a determina comestibilitatea unei ciuperci;
descrierea coloanelor
cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
....
habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
more info here https://www.kaggle.com/uciml/mushroom-classification
Vom prezice coloana "class", care poate avea 2 valori:
'e' - edible (comestibil) sau
'p' - 'poisonous' (otravitor)
Importam cateva librarii necesare
Încărcăm setul de date
class
cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
0
p
x
s
n
t
p
f
c
n
k
e
e
s
s
w
w
p
w
o
p
k
s
u
1
e
x
s
y
t
a
f
c
b
k
e
c
s
s
w
w
p
w
o
p
n
n
g
2
e
b
s
w
t
l
f
c
b
n
e
c
s
s
w
w
p
w
o
p
n
n
m
3
p
x
y
w
t
p
f
c
n
n
e
e
s
s
w
w
p
w
o
p
k
s
u
4
e
x
s
g
f
n
f
w
b
k
t
e
s
s
w
w
p
w
o
e
n
a
g
class
cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
count
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
8124
unique
2
6
4
10
2
9
2
2
2
12
2
5
4
4
9
9
1
4
3
5
9
6
7
top
e
x
y
n
f
n
f
c
b
b
t
b
s
s
w
w
p
w
o
p
w
v
d
freq
4208
3656
3244
2284
4748
3528
7914
6812
5612
1728
4608
3776
5176
4936
4464
4384
8124
7924
7488
3968
2388
4040
3148
Ce concluzii deducem?
Ce tipuri de date avem?
Avem date lipsa?
Vizualizam datele

Preprocesăm datele
Date categoriale - Encoding
mai intai vom face o copie a datelor
class
cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
0
1
5
2
4
1
6
1
0
1
4
0
3
2
2
7
7
0
2
1
4
2
3
5
1
0
5
2
9
1
0
1
0
0
4
0
2
2
2
7
7
0
2
1
4
3
2
1
2
0
0
2
8
1
3
1
0
0
5
0
2
2
2
7
7
0
2
1
4
3
2
3
3
1
5
3
8
1
6
1
0
1
5
0
3
2
2
7
7
0
2
1
4
2
3
5
4
0
5
2
3
0
5
1
1
0
4
1
3
2
2
7
7
0
2
1
0
3
0
1
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Separam datele de antrenare de clase
printam numele coloanelor mai intai
X-ul va contine features (caracteristici) - toate coloanele in afara de clase
Y-ul va contine doar denumirile claselor
cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
0
5
2
4
1
6
1
0
1
4
0
3
2
2
7
7
0
2
1
4
2
3
5
1
5
2
9
1
0
1
0
0
4
0
2
2
2
7
7
0
2
1
4
3
2
1
2
0
2
8
1
3
1
0
0
5
0
2
2
2
7
7
0
2
1
4
3
2
3
3
5
3
8
1
6
1
0
1
5
0
3
2
2
7
7
0
2
1
4
2
3
5
4
5
2
3
0
5
1
1
0
4
1
3
2
2
7
7
0
2
1
0
3
0
1
class
0
1
1
0
2
0
3
1
4
0
Train test split
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Construim modelul
Antrenam modoelul
Plot decision tree
Evaluarea modelului


Cross-validation score

Tuning - ajustarea modelului
Parametri
Ce parametri avem pentru DecisionTreeClassifier?
criterion {“gini”, “entropy”}, default=”gini” The function to measure the quality of a split
splitter {“best”, “random”}, default=”best” The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
max_depthint, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples
random_stateint*, RandomState instance, default=None Controls the randomness of the estimator
max_leaf_nodesint, default=None Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
class_weightdict, list of dict or “balanced”, default=None Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
Mai multi parametri aici: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Train test validation split
Ajustam modelul cu diversi parametri



alegem cei mai buni parametri
Interpretarea rezultatelor

sa ne uitam la niste exemple
cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
y_true
y_pred
1971
2
0
4
0
5
1
1
0
3
1
3
2
0
7
7
0
2
1
0
3
3
1
0
0
6654
2
2
2
0
8
1
0
1
0
1
0
2
2
6
6
0
2
1
0
7
4
2
1
1
5606
5
3
4
0
2
1
0
1
0
1
0
1
2
7
6
0
2
1
0
7
4
2
1
1
3332
2
3
3
1
5
1
0
0
5
1
1
2
2
3
6
0
2
1
4
3
5
0
0
0
6988
2
2
2
0
7
1
0
1
0
1
0
2
2
6
6
0
2
1
0
7
4
2
1
1

cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
y_true
y_pred
TP
6654
2
2
2
0
8
1
0
1
0
1
0
2
2
6
6
0
2
1
0
7
4
2
1
1
True
5606
5
3
4
0
2
1
0
1
0
1
0
1
2
7
6
0
2
1
0
7
4
2
1
1
True
6988
2
2
2
0
7
1
0
1
0
1
0
2
2
6
6
0
2
1
0
7
4
2
1
1
True
5761
5
3
4
0
8
1
0
1
0
1
0
1
2
7
6
0
2
1
0
7
4
2
1
1
True
5798
5
2
3
1
2
1
0
0
3
1
1
2
0
7
7
0
2
1
4
1
3
5
1
1
True
cap-shape
cap-surface
cap-color
bruises
odor
gill-attachment
gill-spacing
gill-size
gill-color
stalk-shape
stalk-root
stalk-surface-above-ring
stalk-surface-below-ring
stalk-color-above-ring
stalk-color-below-ring
veil-type
veil-color
ring-number
ring-type
spore-print-color
population
habitat
y_true
y_pred
count
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.000000
1302.0
1302.0
1302.000000
1302.000000
1302.000000
1302.00000
1302.000000
1302.0
1302.0
mean
3.447773
2.058372
4.404762
0.168203
3.951613
0.996160
0.026882
0.565284
2.920891
0.505376
0.710445
1.360215
1.402458
5.505376
5.506912
0.0
2.0
1.011521
1.559140
3.969278
4.02381
1.905530
1.0
1.0
std
1.439518
1.102566
2.640617
0.374190
2.571043
0.061874
0.161800
0.495910
3.303837
0.500163
0.809170
0.561416
0.588849
2.196962
2.193807
0.0
0.0
0.163614
1.565615
2.839110
0.59479
1.805573
0.0
0.0
min
0.000000
0.000000
0.000000
0.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.0
2.0
0.000000
0.000000
1.000000
1.00000
0.000000
1.0
1.0
25%
2.000000
2.000000
2.000000
0.000000
2.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
1.000000
6.000000
6.000000
0.0
2.0
1.000000
0.000000
1.000000
4.00000
0.000000
1.0
1.0
50%
3.000000
2.000000
4.000000
0.000000
2.000000
1.000000
0.000000
1.000000
2.000000
1.000000
1.000000
1.000000
1.000000
6.000000
6.000000
0.0
2.0
1.000000
2.000000
3.000000
4.00000
1.000000
1.0
1.0
75%
5.000000
3.000000
5.000000
0.000000
7.000000
1.000000
0.000000
1.000000
7.000000
1.000000
1.000000
2.000000
2.000000
7.000000
7.000000
0.0
2.0
1.000000
2.000000
7.000000
4.00000
4.000000
1.0
1.0
max
5.000000
3.000000
9.000000
1.000000
8.000000
1.000000
1.000000
1.000000
11.000000
1.000000
3.000000
2.000000
3.000000
7.000000
8.000000
0.0
2.0
2.000000
4.000000
7.000000
5.00000
5.000000
1.0
1.0
Boundaries plot

Last updated