RogueWave

imsl.data_mining.C45DecisionTree

class C45DecisionTree(response_col_idx, var_type, criteria=0, use_gain_ratio=False, min_n_node=7, min_split=21, max_x_cats=10, max_size=100, max_depth=10, priors=None, response_name='Y', var_names=None, class_names=None, categ_names=None)

Generate a decision tree using the C4.5 method.

Generate a decision tree for a single response variable and two or more predictor variables using the C4.5 method.

Parameters:

response_col_idx : int

Column index of the response variable.

var_type : (N,) array_like

Array indicating the type of each variable.

var_type[i] Type
0 Categorical
1 Ordered Discrete (Low, Med., High)
2 Quantitative or Continuous
3 Ignore this variable

criteria : int, optional

Specifies which criteria the C4.5 method should use in the gain calculations to determine the best split at each node.

criteria Measure
0 Shannon Entropy
1 Gini Index
2 Deviance

Default is 0.

Shannon Entropy

Shannon Entropy is a measure of randomness or uncertainty. For a categorical variable having \(C\) distinct values over a dataset \(S\), the Shannon Entropy is defined as

\(\sum_{i=1}^{C}p_i\log(p_i)\)

where

\(p_i = Pr(Y=i)\)

and where

\(p_i \log(p_i) = 0\)

if \(p_i=0\).

Gini Index

Gini Index is a measure of statistical dispersion. For a categorical variable having \(C\) distinct values over a dataset \(S\), the Gini index is defined as

\(I(S)=\sum_{\begin{array}{c}i,j=1\\i\ne j\end{array}}^C p(i|S)p(j|S)=1-\sum^C_{i=1}p^2(i|S)\)

where \(p(i|S)\) denotes the probability that the variable is equal to the state \(i\) on the dataset \(S\).

Deviance

Deviance is a measure of the quality of fit. For a categorical variable having \(C\) distinct values over a dataset \(S\), the Deviance measure is

\(\sum_{i=1}^{C}n_i\log(p_i)\)

where

\(p_i = Pr(Y=i)\)

and \(n_i\) is the number of cases with \(Y=i\) on the node.

use_gain_ratio : bool, optional

The C4.5 method uses a gain ratio instead of just the gain to determine the best split.

Default is False.

min_n_node : int, optional

Do not split a node if one of its child nodes will have fewer than min_n_node observations.

Default is 7.

min_split : int, optional

Do not split a node if the node has fewer than min_split observations.

Default is 21.

max_x_cats : int, optional

Allow for up to max_x_cats for categorical predictor variables.

Default is 10.

max_size : int, optional

Stop growing the tree once it has reached max_size number of nodes.

Default is 100.

max_depth : int, optional

Stop growing the tree once it has reached max_depth number of levels.

Default is 10.

priors : (N,) array_like, optional

An array containing prior probabilities for class membership. The argument is ignored for continuous response variables. By default, the prior probabilities are estimated from the data.

response_name : string, optional

A string representing the name of the response variable.

Default is “Y”.

var_names : tuple, optional

A tuple containing strings representing the names of predictors.

Default is “X0”, “X1”, etc.

class_names : tuple, optional

A tuple containing strings representing the names of the different classes in Y, assuming Y is of categorical type.

Default is “0”, “1”, etc.

categ_names : tuple, optional

A tuple containing strings representing the names of the different category levels for each predictor of categorical type.

Default is “0”, “1”, etc.

Notes

The method C4.5 ([R4]) is a tree partitioning algorithm for a categorical response variable and categorical or quantitivate predictor variables. The procedure follows the general steps outlined above, using as splitting criterion the information gain or gain ratio. Specifically, the entropy or uncertainty in the response variable with C categories over the full training sample S is defined as

\[E(S)=-\sum_{i=1}^Cp_i\mbox{log}\left(p_i\right)\]

Where \(p_i=\mbox{Pr}[Y=i|S]\) is the probability that the response takes on category i on the dataset S. This measure is widely known as the Shannon Entropy. Splitting the dataset further may either increase or decrease the entropy in the response variable. For example, the entropy of Y over a partitioning of S by X, a variable with K categories, is given by

\[E(S,X)=-\sum_{k=1}^{K}\sum_{i=1}^{C_k}p\left(S_k\right)E\left(S_k\right)\]

If any split defined by the values of a categorical predictor decreases the entropy in Y, then it is said to yield information gain:

\[g(S,X)=E(S)-E(S,X)\]

The best splitting variable according to the information gain criterion is the variable yielding the largest information gain, calculated in this manner. A modified criterion is the gain ratio:

\[gR(S,X)=\frac{E(S)-E(S,X)}{E_X\left(S\right)}\]

where

\[E_x\left(S\right)=-\sum_{k=1}^K\nu_k\mbox{log}\left(\nu_k\right)\]

with

\[\nu_k=\mbox{Pr}[X=k|S]\]

Note that \(E_X(S)\) is just the entropy of the variable X over S. The gain ratio is thought to be less biased toward predictors with many categories. C4.5 treats the continuous variable similarly, except that only binary splits of the form \(X\le d\) and \(X\gt d\) are considered, where d is a value in the range of X on S. The best split is determined by the split variable and split point that gives the largest criterion value. It is possible that no variable meets the threshold for further splitting at the current node, in which case growing stops and the node becomes a terminal node. Otherwise, the node is split, creating two or more child nodes. Then, using the dataset partition defined by the splitting variable and split value, the very same procedure is repeated for each child node. Thus a collection of nodes and child-nodes are generated, or, in other words, the tree is grown. The growth stops after one or more different conditions are met.

References

[R4](1, 2) Quinlan, J.R. (1993). C4.5 Programs for Machine Learning, Morgan Kaufmann. For the latest information on Quinlan’s algorithms see http://www.rulequest.com/.

Examples

In this example, we use a small dataset with response variable, Play, which indicates whether a golfer plays (1) or does not play (0) golf under weather conditions measured by Temperature, Humidity, Outlook (Sunny (0), Overcast (1), Rainy (2)), and Wind (True (0), False (1)). A decision tree is generated by the C4.5 method. The control parameters are adjusted because of the small data size, and no cross-validation or pruning is performed. Notice that C4.5 splits on Outlook, then Humidity and Wind.

>>> import numpy as np
>>> import imsl.data_mining as dm
>>> xy = np.array([[0, 85.0, 85.0, 0, 0],
...                [0, 80.0, 90.0, 1, 0],
...                [1, 83.0, 78.0, 0, 1],
...                [2, 70.0, 96.0, 0, 1],
...                [2, 68.0, 80.0, 0, 1],
...                [2, 65.0, 70.0, 1, 0],
...                [1, 64.0, 65.0, 1, 1],
...                [0, 72.0, 95.0, 0, 0],
...                [0, 69.0, 70.0, 0, 1],
...                [2, 75.0, 80.0, 0, 1],
...                [0, 75.0, 70.0, 1, 1],
...                [1, 72.0, 90.0, 1, 1],
...                [1, 81.0, 75.0, 0, 1],
...                [2, 71.0, 80.0, 1, 0]])
>>> response_column_index = 4
>>> var_type = np.array([0, 2, 2, 0, 0], dtype=int)
>>> names = ["Outlook", "Temperature", "Humidity", "Wind"]
>>> class_names = ["Don't Play", "Play"]
>>> var_levels = ["Sunny", "Overcast", "Rainy", "False", "True"]
>>> print("Decision Tree using Method C4.5:\n\n")
... 
Decision Tree using Method C4.5:
>>> with dm.C45DecisionTree(response_column_index, var_type,
...                         min_n_node=2, min_split=3, max_x_cats=10,
...                         max_size=50, max_depth=10, var_names=names,
...                         class_names=class_names,
...                         categ_names=var_levels) as decision_tree:
...     decision_tree.train(xy)
...     print(decision_tree)
... 
Decision Tree:
Node 0: Cost = 0.357, N = 14.0, Level = 0, Child Nodes:  1  4  5
P(Y=0) = 0.357
P(Y=1) = 0.643
Predicted Y: Play
   Node 1: Cost = 0.143, N = 5.0, Level = 1, Child Nodes:  2  3
   Rule: Outlook in: { Sunny }
   P(Y=0) = 0.600
   P(Y=1) = 0.400
   Predicted Y: Don't Play
      Node 2: Cost = 0.000, N = 2.0, Level = 2
      Rule: Humidity <= 77.500
      P(Y=0) = 0.000
      P(Y=1) = 1.000
      Predicted Y: Play
      Node 3: Cost = 0.000, N = 3.0, Level = 2
      Rule: Humidity > 77.500
      P(Y=0) = 1.000
      P(Y=1) = 0.000
      Predicted Y: Don't Play
   Node 4: Cost = 0.000, N = 4.0, Level = 1
   Rule: Outlook in: { Overcast }
   P(Y=0) = 0.000
   P(Y=1) = 1.000
   Predicted Y: Play
   Node 5: Cost = 0.143, N = 5.0, Level = 1, Child Nodes:  6  7
   Rule: Outlook in: { Rainy }
   P(Y=0) = 0.400
   P(Y=1) = 0.600
   Predicted Y: Play
      Node 6: Cost = 0.000, N = 3.0, Level = 2
      Rule: Wind in: { False }
      P(Y=0) = 0.000
      P(Y=1) = 1.000
      Predicted Y: Play
      Node 7: Cost = 0.000, N = 2.0, Level = 2
      Rule: Wind in: { True }
      P(Y=0) = 1.000
      P(Y=1) = 0.000
      Predicted Y: Don't Play
>>> import numpy as np
>>> import imsl.data_mining as dm
>>> xy = np.array([[2, 0, 2],
...                [1, 0, 0],
...                [2, 1, 3],
...                [0, 1, 0],
...                [1, 2, 0],
...                [2, 2, 3],
...                [2, 2, 3],
...                [0, 1, 0],
...                [0, 0, 0],
...                [0, 1, 0],
...                [1, 2, 0],
...                [2, 0, 2],
...                [0, 2, 0],
...                [2, 0, 1],
...                [0, 0, 0],
...                [2, 0, 1],
...                [1, 0, 0],
...                [0, 2, 0],
...                [2, 0, 1],
...                [1, 2, 0],
...                [0, 2, 2],
...                [2, 1, 3],
...                [1, 1, 0],
...                [2, 2, 3],
...                [1, 2, 0],
...                [2, 2, 3],
...                [2, 0, 1],
...                [2, 1, 3],
...                [1, 2, 0],
...                [1, 1, 0]])
>>> response_column_index = 2
>>> var_type = np.array([0, 0, 0], dtype=int)
>>> names = ['Var1', 'Var2']
>>> class_names = ["c1", "c2", "c3", "c4"]
>>> response_name = 'Response'
>>> var_levels = ["L1", "L2", "L3", "A", "B", "C"]
>>> print("Decision Tree using Method C4.5:\n\n")
... 
Decision Tree using Method C4.5:
>>> with dm.C45DecisionTree(response_column_index, var_type,
...                         min_n_node=5, min_split=10, max_x_cats=10,
...                         max_size=50, max_depth=10,
...                         response_name=response_name, var_names=names,
...                         class_names=class_names,
...                         categ_names=var_levels) as decision_tree:
...     decision_tree.train(xy)
...     print(decision_tree)
... 
Decision Tree:
Node 0: Cost = 0.467, N = 30.0, Level = 0, Child Nodes:  1  2  3
P(Y=0) = 0.533
P(Y=1) = 0.133
P(Y=2) = 0.100
P(Y=3) = 0.233
Predicted Response: c1
   Node 1: Cost = 0.033, N = 8.0, Level = 1
   Rule: Var1 in: { L1 }
   P(Y=0) = 0.875
   P(Y=1) = 0.000
   P(Y=2) = 0.125
   P(Y=3) = 0.000
   Predicted Response: c1
   Node 2: Cost = 0.000, N = 9.0, Level = 1
   Rule: Var1 in: { L2 }
   P(Y=0) = 1.000
   P(Y=1) = 0.000
   P(Y=2) = 0.000
   P(Y=3) = 0.000
   Predicted Response: c1
   Node 3: Cost = 0.200, N = 13.0, Level = 1
   Rule: Var1 in: { L3 }
   P(Y=0) = 0.000
   P(Y=1) = 0.308
   P(Y=2) = 0.154
   P(Y=3) = 0.538
   Predicted Response: c4

Attributes

categ_names Return names of category levels for each categorical predictor.
class_names Return names of different classes in Y.
n_classes Return number of classes assumed by response variable.
n_levels Return number of levels or depth of tree.
n_nodes Return number of nodes or size of tree.
n_preds Return number of predictors used in the model.
pred_n_values Return number of values of predictor variables.
pred_type Return types of predictor variables.
response_name Return name of the response variable.
response_type Return type of the response variable.
var_names Return names of the predictors.

Methods

predict(data[, weights]) Compute predicted values using a decision tree.
train(training_data[, weights]) Train a decision tree using training data and weights.