This example uses the dataset Kyphosis. The 81 cases represent 81 children who have undergone surgery to correct a type of spinal deformity known as Kyphosis. The response variable is the presence or absence of Kyphosis after the surgery. Three predictors are Age of the patient in months, Start, the number of the vertebra where the surgery started, and Number, the number of vertebra involved in the surgery. This example uses the method QUEST
to produce a maximal tree. It also requests predictions for a test-data set consisting of 10 "new" cases.
import com.imsl.datamining.decisionTree.*; public class QUESTKyphosisData { public static void main(String[] args) throws Exception { double[][] xy = { {0, 71, 3, 5}, {0, 158, 3, 14}, {1, 128, 4, 5}, {0, 2, 5, 1}, {0, 1, 4, 15}, {0, 1, 2, 16}, {0, 61, 2, 17}, {0, 37, 3, 16}, {0, 113, 2, 16}, {1, 59, 6, 12}, {1, 82, 5, 14}, {0, 148, 3, 16}, {0, 18, 5, 2}, {0, 1, 4, 12}, {0, 168, 3, 18}, {0, 1, 3, 16}, {0, 78, 6, 15}, {0, 175, 5, 13}, {0, 80, 5, 16}, {0, 27, 4, 9}, {0, 22, 2, 16}, {1, 105, 6, 5}, {1, 96, 3, 12}, {0, 131, 2, 3}, {1, 15, 7, 2}, {0, 9, 5, 13}, {0, 8, 3, 6}, {0, 100, 3, 14}, {0, 4, 3, 16}, {0, 151, 2, 16}, {0, 31, 3, 16}, {0, 125, 2, 11}, {0, 130, 5, 13}, {0, 112, 3, 16}, {0, 140, 5, 11}, {0, 93, 3, 16}, {0, 1, 3, 9}, {1, 52, 5, 6}, {0, 20, 6, 9}, {1, 91, 5, 12}, {1, 73, 5, 1}, {0, 35, 3, 13}, {0, 143, 9, 3}, {0, 61, 4, 1}, {0, 97, 3, 16}, {1, 139, 3, 10}, {0, 136, 4, 15}, {0, 131, 5, 13}, {1, 121, 3, 3}, {0, 177, 2, 14}, {0, 68, 5, 10}, {0, 9, 2, 17}, {1, 139, 10, 6}, {0, 2, 2, 17}, {0, 140, 4, 15}, {0, 72, 5, 15}, {0, 2, 3, 13}, {1, 120, 5, 8}, {0, 51, 7, 9}, {0, 102, 3, 13}, {1, 130, 4, 1}, {1, 114, 7, 8}, {0, 81, 4, 1}, {0, 118, 3, 16}, {0, 118, 4, 16}, {0, 17, 4, 10}, {0, 195, 2, 17}, {0, 159, 4, 13}, {0, 18, 4, 11}, {0, 15, 5, 16}, {0, 158, 5, 14}, {0, 127, 4, 12}, {0, 87, 4, 16}, {0, 206, 4, 10}, {0, 11, 3, 15}, {0, 178, 4, 15}, {1, 157, 3, 13}, {0, 26, 7, 13}, {0, 120, 2, 13}, {1, 42, 7, 6}, {0, 36, 4, 13} }; double[][] xyTest = { {0, 71, 3, 5}, {1, 128, 4, 5}, {0, 1, 4, 15}, {0, 61, 6, 10}, {0, 113, 2, 16}, {1, 82, 5, 14}, {0, 148, 3, 16}, {0, 1, 4, 12}, {0, 1, 3, 16}, {0, 175, 5, 13} }; DecisionTree.VariableType[] varType = { DecisionTree.VariableType.CATEGORICAL, DecisionTree.VariableType.QUANTITATIVE_CONTINUOUS, DecisionTree.VariableType.QUANTITATIVE_CONTINUOUS, DecisionTree.VariableType.QUANTITATIVE_CONTINUOUS }; String[] names = {"Age", "Number", "Start"}; String[] classNames = {"Absent", "Present"}; String responseName = "Kyphosis"; QUEST dt = new QUEST(xy, 0, varType); dt.setMinObsPerChildNode(5); dt.setMinObsPerNode(10); dt.setMaxNodes(50); dt.setPrintLevel(2); dt.fitModel(); double[] predictions = dt.predict(xyTest); double predErrSS = dt.getMeanSquaredPredictionError(); dt.printDecisionTree(responseName, names, classNames, null, true); System.out.println("\nPredictions for test data:"); System.out.printf("%5s%8s%7s%10s\n", names[0], names[1], names[2], responseName); for (int i = 0; i < xyTest.length; i++) { System.out.printf("%5.0f%8.0f%7.0f", xyTest[i][1], xyTest[i][2], xyTest[i][3]); int idx = (int) predictions[i]; System.out.printf("%10s\n", classNames[idx]); } System.out.printf("\nMean squared prediction error: %f\n", predErrSS); } }
Growing the maximal tree using method QUEST: Node 0has split variable 2 Node 1has split variable 0 Node 2 is a terminal node. It has 7.0 cases--too few cases to split. Node 3 is a terminal node. It has 6.0 cases--too few cases to split. Node 4has split variable 2 Node 5 is a terminal node. It has 6.0 cases--too few cases to split. Node 6has split variable 2 Node 7has split variable 0 Node 8has split variable 0 Node 8 is a terminal node. The split is too thin having count 2.0. Node 9has split variable 1 Node 10 is a terminal node. It has 6.0 cases--too few cases to split. Node 11 is a terminal node, because it is pure. Node 11 is a terminal node. It has 7.0 cases--too few cases to split. Node 12is a terminal node. Could not find a splitting variable. Decision Tree: Node 0: Cost = 0.210, N= 81, Level = 0, Child nodes: 1 4 P(Y=0)= 0.790 P(Y=1)= 0.210 Predicted Kyphosis: Absent Node 1: Cost = 0.074, N= 13, Level = 1, Child nodes: 2 3 Rule: Start <= 5.155 P(Y=0)= 0.538 P(Y=1)= 0.462 Predicted Kyphosis: Absent Node 2: Cost = 0.025, N= 7, Level = 2 Rule: Age <= 84.030 P(Y=0)= 0.714 P(Y=1)= 0.286 Predicted Kyphosis: Absent Node 3: Cost = 0.025, N= 6, Level = 2 Rule: Age > 84.030 P(Y=0)= 0.333 P(Y=1)= 0.667 Predicted Kyphosis: Present Node 4: Cost = 0.136, N= 68, Level = 1, Child nodes: 5 6 Rule: Start > 5.155 P(Y=0)= 0.838 P(Y=1)= 0.162 Predicted Kyphosis: Absent Node 5: Cost = 0.012, N= 6, Level = 2 Rule: Start <= 8.862 P(Y=0)= 0.167 P(Y=1)= 0.833 Predicted Kyphosis: Present Node 6: Cost = 0.074, N= 62, Level = 2, Child nodes: 7 12 Rule: Start > 8.862 P(Y=0)= 0.903 P(Y=1)= 0.097 Predicted Kyphosis: Absent Node 7: Cost = 0.062, N= 28, Level = 3, Child nodes: 8 9 Rule: Start <= 13.092 P(Y=0)= 0.821 P(Y=1)= 0.179 Predicted Kyphosis: Absent Node 8: Cost = 0.025, N= 15, Level = 4 Rule: Age <= 91.722 P(Y=0)= 0.867 P(Y=1)= 0.133 Predicted Kyphosis: Absent Node 9: Cost = 0.037, N= 13, Level = 4, Child nodes: 10 11 Rule: Age > 91.722 P(Y=0)= 0.769 P(Y=1)= 0.231 Predicted Kyphosis: Absent Node 10: Cost = 0.037, N= 6, Level = 5 Rule: Number <= 3.450 P(Y=0)= 0.500 P(Y=1)= 0.500 Predicted Kyphosis: Absent Node 11: Cost = 0.000, N= 7, Level = 5 Rule: Number > 3.450 P(Y=0)= 1.000 P(Y=1)= 0.000 Predicted Kyphosis: Absent Node 12: Cost = 0.012, N= 34, Level = 3 Rule: Start > 13.092 P(Y=0)= 0.971 P(Y=1)= 0.029 Predicted Kyphosis: Absent Predictions for test data: Age Number Start Kyphosis 71 3 5 Absent 128 4 5 Present 1 4 15 Absent 61 6 10 Absent 113 2 16 Absent 82 5 14 Absent 148 3 16 Absent 1 4 12 Absent 1 3 16 Absent 175 5 13 Absent Mean squared prediction error: 0.100000Link to Java source.