Working with Regression Trees in Python

Objectives

Decision Trees are one of the most popular approaches to supervised machine learning. Decison Trees use an inverted tree-like structure to model the relationship between independent variables and a dependent variable. A tree with a continuous dependent variable is known as a Regression Tree. In this script, i will :

Load, explore and prepare iris data
Build a Regression Tree model
Visualize the structure of the Regression Tree
Prune the Regression Tree

1. Load the iris Data

from sklearn.datasets import load_iris
iris

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

2. Explore the Data

iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal.Length  150 non-null    float64
 1   Sepal.Width   150 non-null    float64
 2   Petal.Length  150 non-null    float64
 3   Petal.Width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

iris.describe()

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

ay=sns.boxplot(data = iris, x='Species', y = 'Sepal.Length')

png

ax=sns.boxplot(data = iris, x='Species', y = 'Sepal.Width')

png

ax=sns.boxplot(data = iris, x='Species', y = 'Petal.Length')

png

ax=sns.boxplot(data = iris, x='Species', y = 'Petal.Width')

png

ax = sns.scatterplot(data = iris,
                     x = 'Sepal.Length',
                     y = 'Sepal.Width',
                     hue = 'Species',
                     style = 'Species',
                     s = 150)
ax = plt.legend(bbox_to_anchor = (1.02, 1), loc = 'upper left')

png

ax = sns.scatterplot(data = iris,
                     x = 'Petal.Length',
                     y = 'Petal.Width',
                     hue = 'Species',
                     style = 'Species',
                     s = 150)
ax = plt.legend(bbox_to_anchor = (1.02, 1), loc = 'upper left')

png

3. Prepare the Data

import pandas as pd

y=iris[['Sepal.Width']]

X=iris[['Species','Sepal.Length',  'Petal.Length', 'Petal.Width']]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.6,
                                                    stratify = X['Species'],
                                                    random_state = 1234)

X_train.shape, X_test.shape

((90, 4), (60, 4))

X_train.head()

	Species	Sepal.Length	Petal.Length	Petal.Width
61	versicolor	5.9	4.2	1.5
79	versicolor	5.7	3.5	1.0
8	setosa	4.4	1.4	0.2
140	virginica	6.7	5.6	2.4
81	versicolor	5.5	3.7	1.0

X_train = pd.get_dummies(X_train)
X_train.head()

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
61	5.9	4.2	1.5	0	1	0
79	5.7	3.5	1.0	0	1	0
8	4.4	1.4	0.2	1	0	0
140	6.7	5.6	2.4	0	0	1
81	5.5	3.7	1.0	0	1	0

X_test = pd.get_dummies(X_test)
X_test.head()

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
60	5.0	3.5	1.0	0	1	0
132	6.4	5.6	2.2	0	0	1
75	6.6	4.4	1.4	0	1	0
119	6.0	5.0	1.5	0	0	1
46	5.1	1.6	0.2	1	0	0

4. Train and Evaluate the Regression Tree

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 1234)

model = regressor.fit(X_train, y_train)

model.score(X_test, y_test)

0.33023514005921195

y_test_pred = model.predict(X_test)
y_test_pred

array([2.3 , 2.8 , 3.2 , 2.5 , 3.4 , 3.  , 3.4 , 3.1 , 3.1 , 3.1 , 2.9 ,
       3.1 , 2.5 , 3.2 , 3.5 , 2.8 , 3.3 , 3.6 , 3.6 , 2.8 , 3.  , 3.4 ,
       2.6 , 3.1 , 2.3 , 2.2 , 3.2 , 2.8 , 3.  , 2.5 , 3.  , 3.  , 3.2 ,
       3.1 , 3.1 , 3.2 , 3.4 , 3.6 , 2.3 , 3.2 , 2.8 , 3.1 , 3.  , 3.8 ,
       3.  , 3.4 , 3.4 , 3.6 , 3.8 , 3.45, 2.9 , 2.7 , 2.9 , 3.4 , 2.3 ,
       3.  , 2.9 , 3.4 , 2.9 , 2.9 ])

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_test_pred)

0.28083333333333343

5. Visualize the Regression Tree

from sklearn import tree
plt.figure(figsize = (15,15))
tree.plot_tree(model,
                   feature_names = list(X_train.columns),
                   filled = True);

png

plt.figure(figsize = (15,15))
tree.plot_tree(model,
               feature_names = list(X_train.columns),
               filled = True,
               max_depth = 1);

png

importance = model.feature_importances_
importance

array([0.33538054, 0.10600708, 0.54609818, 0.        , 0.01251419,
       0.        ])

feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance.sort_values().plot(kind = 'bar')
plt.ylabel('Importance');

png

6. Prune the Regression Tree

Pruning is use in decision trees training to avoid overfitting. It's can happen if we allow it to grow to its max depth and in another hand we can also stop the it earlier. To avoid overfitting, we can apply early stopping rules know as pre-pruning. Another option to avoid overfitting is to apply post-pruning (sometimes just called pruning). If you want to learn about these two methods, check these articles, for pre-pruning, and post-pruning.

model.score(X_train, y_train)

0.9972869047938048

model.score(X_test, y_test)

0.33023514005921195

Let's get the list of effective alphas for the training data.

path = regressor.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
list(ccp_alphas)

[0.0,
 3.947459643111668e-17,
 3.947459643111668e-17,
 7.894919286223336e-17,
 7.894919286223336e-17,
 9.868649107779169e-17,
 4.166666666660903e-05,
 5.5555555555465556e-05,
 5.55555555555445e-05,
 5.555555555556424e-05,
 5.5555555555593844e-05,
 5.555555555562345e-05,
 6.597222222212274e-05,
 7.407407407402644e-05,
 7.407407407406591e-05,
 7.407407407408565e-05,
 7.407407407414487e-05,
 7.561728395083143e-05,
 8.333333333323781e-05,
 0.00014814814814807263,
 0.0001493827160493745,
 0.0001666666666666039,
 0.00016666666666671244,
 0.000190476190476099,
 0.0002222222222222175,
 0.00022407407407397286,
 0.00023148148148146832,
 0.00029629629629641165,
 0.0003555555555555361,
 0.00036805555555567476,
 0.00037037037037030985,
 0.0004537037037035871,
 0.00046296296296299594,
 0.0004637345679009987,
 0.0005333333333332648,
 0.0006007147498388933,
 0.0006857142857141045,
 0.0007111111111110131,
 0.0007851851851852073,
 0.0008571428571428207,
 0.00088888888888887,
 0.0008888888888888897,
 0.0008888888888891858,
 0.0009074074074073519,
 0.00094814814814832,
 0.0010416666666665877,
 0.0013444444444438769,
 0.0013773504273509158,
 0.00179259259259279,
 0.0019999999999998587,
 0.002156410256410328,
 0.002209046402724355,
 0.002373995797798869,
 0.002624999999999389,
 0.004160401002505384,
 0.005411255411258784,
 0.007378661708034001,
 0.016133333333333923,
 0.024416090731883597,
 0.07823209876543245]

We remove the maximum effective alpha because it is the trivial tree with just one node.

ccp_alphas = ccp_alphas[:-1]
list(ccp_alphas)

[0.0,
 3.947459643111668e-17,
 3.947459643111668e-17,
 7.894919286223336e-17,
 7.894919286223336e-17,
 9.868649107779169e-17,
 4.166666666660903e-05,
 5.5555555555465556e-05,
 5.55555555555445e-05,
 5.555555555556424e-05,
 5.5555555555593844e-05,
 5.555555555562345e-05,
 6.597222222212274e-05,
 7.407407407402644e-05,
 7.407407407406591e-05,
 7.407407407408565e-05,
 7.407407407414487e-05,
 7.561728395083143e-05,
 8.333333333323781e-05,
 0.00014814814814807263,
 0.0001493827160493745,
 0.0001666666666666039,
 0.00016666666666671244,
 0.000190476190476099,
 0.0002222222222222175,
 0.00022407407407397286,
 0.00023148148148146832,
 0.00029629629629641165,
 0.0003555555555555361,
 0.00036805555555567476,
 0.00037037037037030985,
 0.0004537037037035871,
 0.00046296296296299594,
 0.0004637345679009987,
 0.0005333333333332648,
 0.0006007147498388933,
 0.0006857142857141045,
 0.0007111111111110131,
 0.0007851851851852073,
 0.0008571428571428207,
 0.00088888888888887,
 0.0008888888888888897,
 0.0008888888888891858,
 0.0009074074074073519,
 0.00094814814814832,
 0.0010416666666665877,
 0.0013444444444438769,
 0.0013773504273509158,
 0.00179259259259279,
 0.0019999999999998587,
 0.002156410256410328,
 0.002209046402724355,
 0.002373995797798869,
 0.002624999999999389,
 0.004160401002505384,
 0.005411255411258784,
 0.007378661708034001,
 0.016133333333333923,
 0.024416090731883597]

Next, we train several trees using the different values for alpha.

train_scores, test_scores = [], []
for alpha in ccp_alphas:
    regressor_ = DecisionTreeRegressor(random_state = 1234, ccp_alpha = alpha)
    model_ = regressor_.fit(X_train, y_train)
    train_scores.append(model_.score(X_train, y_train))
    test_scores.append(model_.score(X_test, y_test))

plt.plot(ccp_alphas,
         train_scores,
         marker = "o",
         label = 'train_score',
         drawstyle = "steps-post")
plt.plot(ccp_alphas,
         test_scores,
         marker = "o",
         label = 'test_score',
         drawstyle = "steps-post")
plt.legend()
plt.title('R-squared by alpha');

png

test_scores

[0.33023514005921195,
 0.33023514005921195,
 0.33023514005921195,
 0.33023514005921195,
 0.33023514005921195,
 0.33023514005921195,
 0.33989623662035984,
 0.33989623662035984,
 0.33989623662035984,
 0.3375476827601912,
 0.3407502562058755,
 0.3407502562058755,
 0.3404566869733544,
 0.3374201728915204,
 0.3294493234267061,
 0.3309675804676231,
 0.3294493234267061,
 0.3220988901962767,
 0.3207644845939083,
 0.31867688116264736,
 0.3315133227055339,
 0.330659303120018,
 0.3334348667729443,
 0.3337137303110721,
 0.34737804367932534,
 0.3448129009631634,
 0.3461769600233623,
 0.34807478132450853,
 0.34506863238349283,
 0.3658675677057428,
 0.3738384171705572,
 0.3767859708789002,
 0.3830725039389472,
 0.3952007681915851,
 0.40503907381672744,
 0.3847448715609173,
 0.3891118745679958,
 0.3928467868886516,
 0.4042640798363476,
 0.4015974472530024,
 0.3640205854903058,
 0.3811009772006224,
 0.4152617606212555,
 0.40073156628435425,
 0.42957845006177786,
 0.41970384860425103,
 0.4183886584425567,
 0.4876104326741413,
 0.5081524504377486,
 0.4979042154115587,
 0.4591965609704547,
 0.47294501371578745,
 0.43038709152545673,
 0.41744228965800056,
 0.4923501729589913,
 0.4788247151954669,
 0.41815815226633946,
 0.26935377968606145,
 0.273134441660973]

ix = test_scores.index(max(test_scores))
best_alpha = ccp_alphas[ix]
best_alpha

0.00179259259259279

regressor_ = DecisionTreeRegressor(random_state = 1234, ccp_alpha = best_alpha)
model_ = regressor_.fit(X_train, y_train)

model_.score(X_train, y_train)

0.8589647876821336

model_.score(X_test, y_test)

0.5081524504377486

plt.figure(figsize = (15,15))
tree.plot_tree(model_,
                   feature_names = list(X_train.columns),
                   filled = True);

png

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
61	5.9	4.2	1.5	0	1	0
79	5.7	3.5	1.0	0	1	0
8	4.4	1.4	0.2	1	0	0
140	6.7	5.6	2.4	0	0	1
81	5.5	3.7	1.0	0	1	0

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
60	5.0	3.5	1.0	0	1	0
132	6.4	5.6	2.2	0	0	1
75	6.6	4.4	1.4	0	1	0
119	6.0	5.0	1.5	0	0	1
46	5.1	1.6	0.2	1	0	0

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
61	5.9	4.2	1.5	0	1	0
79	5.7	3.5	1.0	0	1	0
8	4.4	1.4	0.2	1	0	0
140	6.7	5.6	2.4	0	0	1
81	5.5	3.7	1.0	0	1	0

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
60	5.0	3.5	1.0	0	1	0
132	6.4	5.6	2.2	0	0	1
75	6.6	4.4	1.4	0	1	0
119	6.0	5.0	1.5	0	0	1
46	5.1	1.6	0.2	1	0	0

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
61	5.9	4.2	1.5	0	1	0
79	5.7	3.5	1.0	0	1	0
8	4.4	1.4	0.2	1	0	0
140	6.7	5.6	2.4	0	0	1
81	5.5	3.7	1.0	0	1	0

	Sepal.Length	Petal.Length	Petal.Width	Species_setosa	Species_versicolor	Species_virginica
60	5.0	3.5	1.0	0	1	0
132	6.4	5.6	2.2	0	0	1
75	6.6	4.4	1.4	0	1	0
119	6.0	5.0	1.5	0	0	1
46	5.1	1.6	0.2	1	0	0