Task: De implementat un algoritm cu if-uri conform schemei de mai sus
defdecision(age,pizza,hamburger,exercise):if age <30:elif
Ș### Problema Decision trees
(e imagine aici muahahahaha)
TODO:
# TODO:from sklearn.datasets import load_irisfrom sklearn import treeX, y =load_iris(return_X_y=True)import matplotlib.pyplot as pltplt.plot(X[:, 2], 'o')
[<matplotlib.lines.Line2D at 0x7f8682238c88>]
# TODO: de facut DecisionTreeClassifier#
from sklearn.tree import DecisionTreeClassifier, plot_treeplot_tree(clf)
Clasificarea setului de date despre ciuperci - practica
Acest set de date include descrierile eșantioanelor ipotetice corespunzătoare a 23 de specii de ciuperci. Fiecare specie este identificată ca fiind definitiv comestibilă, definitiv otrăvitoare sau de comestibilitate necunoscută și nu este recomandată. Această ultimă clasă a fost combinată cu cea otrăvitoare. Ghidul precizează clar că nu există o regulă simplă pentru a determina comestibilitatea unei ciuperci;
# ce conditii if putem crea aici?plt.scatter(data['odor'], data['class'])plt.title('Odor vs class')plt.xlabel('Odor')plt.ylabel('Class')# plt.legend()plt.show()
Preprocesăm datele
Date categoriale - Encoding
# Exemplu cu LabelEncoder:from sklearn.preprocessing import LabelEncoderlista_categorii = ['edible','poisonous']le =LabelEncoder()le.fit(lista_categorii)lista_categorii_encoded = le.transform(lista_categorii)np.unique(lista_categorii_encoded)
array([0, 1])
#ce va returna le.transform(['edible', 'poisonous','poisonous'])
array([0, 1, 1])
mapare =dict(poisonous=1, edible=0)# {'edible': 0, 'poisonous': 1}mapat = []for i in ['edible','poisonous','poisonous']: mapat.append(mapare[i])mapat
[0, 1, 1]
mai intai vom face o copie a datelor
data_encoded = data.copy()
# from sklearn.tree import DecisionTreeClassifierfrom sklearn.preprocessing import LabelEncoder# Transformam toate coloanele in date numericefor (columnName, columnData) in data_encoded.iteritems(): le =LabelEncoder() le.fit(columnData) data_encoded[columnName]= le.transform(columnData)# print(np.unique(data_encoded['class']))data_encoded.head()
from sklearn.tree import DecisionTreeClassifier, plot_treeplt.figure(figsize=(20, 20))plot_tree(clf)
NameError Traceback (most recent call last)
<ipython-input-1-34989c638434> in <module>()
1 from sklearn.tree import DecisionTreeClassifier, plot_tree
----> 2 plt.figure(figsize=(20, 20))
3 plot_tree(clf)
NameError: name 'plt' is not defined
criterion {“gini”, “entropy”}, default=”gini” The function to measure the quality of a split
splitter {“best”, “random”}, default=”best” The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
max_depthint, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples
random_stateint*, RandomState instance, default=None Controls the randomness of the estimator
max_leaf_nodesint, default=None Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
class_weightdict, list of dict or “balanced”, default=None Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
#train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
dataX, dataY = X, Y
train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10
# train is now 75% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
print(x_train.shape, x_val.shape, x_test.shape)
(6093, 22) (1218, 22) (813, 22)
Ajustam modelul cu diversi parametri
import random
from sklearn.metrics import accuracy_score, f1_score, recall_score
# Exemplu cu train test validation for ... sa incercam mai multi parametri
random_states_list = [0,1, 2]
# If None then unlimited number of leaf nodes.
# max_leaf_nodes_list = [None, 5, 10, 100]
# max_depth_list = [1, 2, 20, 50]
acc_scores = []
f1_scores = []
recall_scores = []
Y_predicts = []
for random_state in random_states_list:
max_leaf_nodes = None
max_depth = 4
# for max_leaf_nodes in max_leaf_nodes_list:
# for max_depth in max_depth_list:
clf = DecisionTreeClassifier(random_state=random_state, max_leaf_nodes=max_leaf_nodes, max_depth=max_depth)
clf.fit(x_train, y_train)
Y_predict = clf.predict(x_val)
Y_predicts.append(Y_predict)
# print("Accuracy score:", accuracy_score(Y_test, Y_predict))
acc_scores.append(accuracy_score(y_val, Y_predict))
# print("F1 score:", f1_score(Y_test, Y_predict))
f1_scores.append(f1_score(y_val, Y_predict))
# print("Recall score:", recall_score(Y_test, Y_predict))
recall_scores.append(recall_score(y_val, Y_predict))
# TODO: subplots
plt.title("Cum influenteaza random state asupra accuracy")
plt.title("Cum influenteaza random state asupra recall")
plt.plot(random_states_list, acc_scores, 'x')
plt.xlabel('Random state')
plt.ylabel('Accuracy score')