imsl.data_mining.CHAIDDecisionTree¶

class CHAIDDecisionTree(response_col_idx, var_type, alphas=(0.05, 0.05, -1.0), min_n_node=7, min_split=21, max_x_cats=10, max_size=100, max_depth=10, priors=None, response_name='Y', var_names=None, class_names=None, categ_names=None)¶

Generate a decision tree using the CHAID method.

Generate a decision tree for a single response variable and two or more predictor variables using the CHAID method.

Parameters:

response_col_idx : int

Column index of the response variable.

var_type : (N,) array_like

Array indicating the type of each variable.

var_type[i] Type

0 Categorical

1 Ordered Discrete (Low, Med., High)

2 Quantitative or Continuous

3 Ignore this variable

alphas : tuple, optional

Tuple containing the significance levels. alphas[0] = significance level for split variable selection; alphas[1] = significance level for merging categories of a variable, and alphas[2] = significance level for splitting previously merged categories. Valid values are in the range 0 < alphas[1] < 1.0, and alphas[2] <= alphas[1]. Setting alphas[2] = -1.0 disables splitting of merged categories.

Default is [0.05, 0.05, -1.0].

min_n_node : int, optional

Do not split a node if one of its child nodes will have fewer than min_n_node observations.

Default is 7.

min_split : int, optional

Do not split a node if the node has fewer than min_split observations.

Default is 21.

max_x_cats : int, optional

Allow for up to max_x_cats for categorical predictor variables.

Default is 10.

max_size : int, optional

Stop growing the tree once it has reached max_size number of nodes.

Default is 100.

max_depth : int, optional

Stop growing the tree once it has reached max_depth number of levels.

Default is 10.

priors : (N,) array_like, optional

An array containing prior probabilities for class membership. The argument is ignored for continuous response variables. By default, the prior probabilities are estimated from the data.

response_name : string, optional

A string representing the name of the response variable.

Default is “Y”.

var_names : tuple, optional

A tuple containing strings representing the names of predictors.

Default is “X0”, “X1”, etc.

class_names : tuple, optional

A tuple containing strings representing the names of the different classes in Y, assuming Y is of categorical type.

Default is “0”, “1”, etc.

categ_names : tuple, optional

A tuple containing strings representing the names of the different category levels for each predictor of categorical type.

Default is “0”, “1”, etc.

Notes

The method CHAID is appropriate only for categorical or discrete ordered predictor variables. Due to Kass ([R5]), CHAID is an acronym for chi-square automatic interaction detection. At each node, imsl.data_mining.CHAIDDecisionTree() looks for the best splitting variable. The approach is as follows: given a predictor variable X, perform a 2-way chi-squared test of association between each possible pair of categories of X with the categories of Y. The least significant result is noted and, if a threshold is met, the two categories of X are merged. Treating this merged category as a single category, repeat the series of tests and determine if there is further merging possible. If a merged category consists of three or more of the original categories of X, imsl.data_mining.CHAIDDecisionTree() calls for a step to test whether the merged categories should be split. This is done by forming all binary partitions of the merged category and testing each one against Y in a 2-way test of association. If the most significant result meets a threshold, then the merged category is split accordingly. As long as the threshold in this step is smaller than the threshold in the merge step, the splitting step and the merge step will not cycle back and forth. Once each predictor is processed in this manner, the predictor with the most significant qualifying 2-way test with Y is selected as the splitting variable, and its last state of merged categories defines the split at the given node. If none of the tests qualify (by having an adjusted p-value smaller than a threshold), then the node is not split. This growing procedure continues until one or more stopping conditions are met.

References

[R5]	(1, 2) Kass, G.V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics, Vol. 29, No. 2, pp. 119-127.

Attributes

`categ_names`	Return names of category levels for each categorical predictor.
`class_names`	Return names of different classes in Y.
`n_classes`	Return number of classes assumed by response variable.
`n_levels`	Return number of levels or depth of tree.
`n_nodes`	Return number of nodes or size of tree.
`n_preds`	Return number of predictors used in the model.
`pred_n_values`	Return number of values of predictor variables.
`pred_type`	Return types of predictor variables.
`response_name`	Return name of the response variable.
`response_type`	Return type of the response variable.
`var_names`	Return names of the predictors.

Methods

`predict`(data[, weights])	Compute predicted values using a decision tree.
`train`(training_data[, weights])	Train a decision tree using training data and weights.