RogueWave

imsl.regression.LogisticRegression.predict

LogisticRegression.predict(data, responses=None, frequencies=None, confidence=95.0)

Predict responses for new predictor samples.

Parameters:

data : (n,) or (n, n_predictors) array_like

Array containing the values of the n_predictors predictor variables. Each of the n rows of data describes one predictor sample.

responses : Object, optional

An object of type ResponseFormatRef, ResponseFormatID or ResponseFormatAll, containing information on the actual responses corresponding to the predictor variables. Essentially, from responses attribute class_counts_matrix, the n by n_classes matrix containing the response counts, is used.

frequencies : (n,) array_like, optional

Array containing the number of replications or trials for each of the n observations. frequencies is not used if optional argument responses is present.

Default: frequencies = 1.

confidence : float, optional

Confidence level used in the calculation of the prediction intervals. For each predicted value, confidence % prediction intervals are provided.

Returns:

A named tuple with the following fields:

predictions : (n, n_classes) ndarray

Array containing the predicted responses.

pred_limits : tuple

A two-element tuple consisting of two ndarrays of format (n, n_classes) containing lower (in pred_limits[0]) and upper (in pred_limits[1]) prediction limits.

pred_err : float

The mean squared prediction error when responses is present. Set to numpy.NaN otherwise.

Notes

Method predict calculates the predicted outcomes for a binomial or multinomial response variable given an estimated logistic regression model and new observations of the predictor variables.

For a binary response y, the objective is to estimate the conditional probability of success, \(\pi_1(x)=\Pr(y=1|x)\), where \(x=(x_1,x_2,\ldots,x_p)^T\) is a realization of p predictors. In particular, the estimated probability of success is

\[\hat{\pi}_1(x) = \frac{\exp(\hat{\eta}_1)}{1+\exp(\hat{\eta}_1)}\,,\]

where

\[\hat{\eta}_1 = \hat{\beta}_{10}+x^T\hat{\beta}_1\]

and

\[\hat{\beta}_{10},\; \hat{\beta}_1=(\hat{\beta}_{11}, \hat{\beta}_{12},\ldots,\hat{\beta}_{1p})^T\]

are the coefficient estimates. Then, \(\hat{y}=n_i\pi_1(x_i)\). That is, \(\hat{y}\) is the expected value of the response under the estimated model given the values of the predictor variables.

Similarly, for a multinomial response, with class K the reference class,

\[\hat{\pi}_k(x)= \frac{\exp(\hat{\eta}_{ik})}{\sum_{l=1}^K\exp(\hat{\eta}_{il})}= \frac{\exp(\hat{\eta}_{ik})} {1+\sum_{l=1}^{K-1}\exp(\hat{\eta}_{il})}\,.\]

Then,

\[\hat{\pi}_K(x)= 1 - \sum_{l=1}^{K-1}\hat{\pi}_l(x)\,,\]

and \(\hat{y}_k=n_i\pi_k(x_i)\). If the actual responses are given, the mean squared prediction error is

\[\text{mspe}=\frac{1}{NK}\sum_{k=1}^K\sum_{i=1}^N (\hat{y}_{ik} - y_{ik})^2\,.\]

By default, 100(1-\(\alpha\))% prediction intervals are provided for the predicted values by first finding the prediction standard errors, SE, of the logits, \(\hat{\eta}_{ik}=\hat{\beta}_{0k}+x_i^T\hat{\beta}_k\), and then evaluating

\[\frac{\exp(\hat{\eta}_{ik}\pm z_{\alpha/2}SE(\hat{\eta}_{ik}))} {1+\sum_{l=1}^{K-1}\exp(\hat{\eta}_{il}\pm z_{\alpha/2} SE(\hat{\eta}_{il}))}\]

to obtain the upper and lower limits for \(\hat{\pi}_k(x_i)\), where \(z_{\alpha/2}\) is the upper \(\alpha/2\) quantile of the standard normal distribution. Note that properties of the prediction intervals are valid only if the new observations are inside the range of the original data used to fit the model. Generally, the model should not be used to extrapolate outside the range of the original data. See [R27] for further details.

References

[R27](1, 2) Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant (2013), Applied Logistic Regression, Third Edition, John Wiley & Sons, New Jersey.
[R28](1, 2) Prentice, Ross L. (1976), A generalization of the probit and logit methods for dose response curves, Biometrics, 32, 761-768.

Examples

Example 1:

The model fit to the beetle mortality data of [R28] is used to predict the expected mortality at three new doses. For the original data, see Example 1 of class LogisticRegression.

Log Dosage Number of Beetles Exposed Number of Deaths
1.66 16 ??
1.87 22 ??
1.71 11 ??
>>> import numpy as np
>>> import imsl.regression as reg
>>> y1 = np.array([6, 13, 18, 28, 52, 53, 61, 60])
>>> x1 = np.array([1.69, 1.724, 1.755, 1.784, 1.811,
...                1.836, 1.861, 1.883])
>>> x2 = np.array([1.66, 1.87, 1.71])
>>> freqs1 = np.array([59, 60, 62, 56, 63, 59, 62, 60])
>>> freqs2 = np.array([16, 22, 11])
>>> n_predictors = 1
>>> n_classes = 2
>>> resp1 = reg.ResponseFormatRef(y1, frequencies=freqs1)
>>> model = reg.LogisticRegression(n_predictors, n_classes)
>>> model.fit(x1, resp1)
>>> np.set_printoptions(precision=2)
>>> print("Coefficient estimates:\n" +
...       str(model.coefficients)) 
Coefficient estimates:
[[-60.76  34.3 ]
 [  0.     0.  ]]
>>> yhat = model.predict(x2, frequencies=freqs2)
>>> print("Dose\t N\tExpected Deaths")  
Dose     N      Expected Deaths
>>> for i in range(x2.size):
...     print("{0:4.2f}\t{1:2.1f}\t\t{2:5.2f}".format(x2[i],
...           freqs2[i],
...           yhat.predictions[i, 0]))  
1.66    16.0             0.34
1.87    22.0            21.28
1.71    11.0             1.19

Example 2:

A logistic regression model is fit to artificial (noisy) data with four classes and three predictor variables and used to predict class probabilities at 10 new values of the predictor variables. Also shown are the mean squared prediction error and upper and lower limits of the 95% prediction interval for each predicted value.

>>> import numpy as np
>>> import imsl.regression as reg
>>> from scipy.stats import chi2
>>> # Array of predictors
>>> x = np.array([[3, 25.92869, 1], [2,51.63245, 2], [2, 25.78432, 1],
...               [1, 39.37948, 1], [3,24.65058, 1], [3, 45.20084, 1],
...               [3, 52.6796, 2], [2, 44.28342, 2], [3, 40.63523, 2],
...               [3, 51.76094, 1], [3, 26.30368, 1], [3, 20.70230, 2],
...               [3, 38.74273, 2], [3,19.47333, 1], [2, 26.42211, 1],
...               [3, 37.05986, 2], [2, 51.67043, 2], [1, 42.40156, 1],
...               [3, 33.90027, 2], [2, 35.43282, 1], [2, 44.30369, 1],
...               [1, 46.72387, 1], [2, 46.99262, 1], [1, 36.05923, 1],
...               [3, 36.83197, 2], [2, 61.66257, 2], [1, 25.67714, 1],
...               [2, 39.08567, 2], [1, 48.84341, 2], [2, 39.34391, 1],
...               [3, 24.73522, 1], [2, 50.55251, 2], [1, 31.34263, 2],
...               [2, 27.15795, 2], [1, 31.72685, 1], [1, 25.00408, 1],
...               [2, 26.35457, 2], [3, 38.12343, 1], [1, 49.9403, 1],
...               [2, 42.45779, 2], [1, 38.80948, 2], [1, 43.22799, 2],
...               [1, 41.87624, 1], [3, 48.0782, 1], [1, 43.23673, 2],
...               [3, 39.41294, 1], [2, 23.93346, 1], [3, 42.8413, 2],
...               [3, 30.40669, 1], [1, 37.77389, 1]])
>>> # Array of response IDs
>>> y = np.array([1, 2, 3, 4, 3, 3, 4, 4, 4, 4, 2, 1, 4, 1, 1, 1,
...               4, 4, 3, 1, 2, 3, 3, 4, 2, 3, 4, 1, 2, 4, 3, 4,
...               4, 1, 3, 4, 4, 2, 3, 4, 2, 2, 4, 3, 1, 4, 3, 4,
...               2, 3])
>>> newx = np.array([[2, 25.92869, 1], [2, 51.63245, 2],
...                  [1, 25.78432, 1], [3, 39.37948, 1],
...                  [3, 24.65058, 1], [3, 45.20084, 1],
...                  [2, 52.6796, 2], [3, 44.28342, 2],
...                  [3,40.63523,2],  [3, 51.76094, 1]])
>>> newy =  np.array([3, 2, 1, 1, 4, 3, 2, 2, 1, 2])
>>> n_predictors = 3
>>> n_classes = 4
>>> resp = reg.ResponseFormatID(y, n_classes)
>>> resp_new = reg.ResponseFormatID(newy, n_classes)
>>> model = reg.LogisticRegression(n_predictors, n_classes)
>>> model.fit(x, resp)
>>> yhat = model.predict(newx, responses=resp_new)
>>> n_coefs = model.n_coeffs
>>> lrstat = model.likeli_ratio_test_stat
>>> dof = n_coefs * (n_classes-1) - (n_classes-1)
>>> model_pval = 1.0 - chi2.cdf(lrstat, dof)
>>> print("Model Fit Summary:\n" +
...       "Log-likelihood: {0:5.2f}\n".format(model.log_likeli) +
...       "LR test statistic: {0:5.2f}\n".format(lrstat) +
...       "Degrees of freedom: {0:2d}\n".format(dof) +
...       "P-value: {0:5.4f}".format(model_pval))
... 
Model Fit Summary:
Log-likelihood: -58.58
LR test statistic: 16.37
Degrees of freedom:  9
P-value: 0.0595
>>> print("Prediction Summary:\n" +
...       "Mean squared prediction error: {0:4.2f}".format(
...                                                 yhat.pred_err))
... 
Prediction Summary:
Mean squared prediction error: 0.21
>>> print("Obs Class Estimate Lower Upper")
Obs Class Estimate Lower Upper
>>> for j in range(newx.shape[0]):
...     for i in range(n_classes):
...         print(" {0:2d}\t{1:d}     {2:4.2f}   {3:4.2f}  "
...               "{4:4.2f}".format(j+1, i+1,
...                                 yhat.predictions[j, i],
...                                 yhat.pred_limits[0][j, i],
...                                 yhat.pred_limits[1][j, i]))
... 
  1 1     0.26   0.14  0.35
  1 2     0.14   0.06  0.20
  1 3     0.31   0.18  0.36
  1 4     0.29   0.10  0.62
  2 1     0.04   0.01  0.14
  2 2     0.27   0.11  0.39
  2 3     0.12   0.04  0.25
  2 4     0.57   0.22  0.85
  3 1     0.23   0.07  0.38
  3 2     0.13   0.04  0.20
  3 3     0.28   0.12  0.34
  3 4     0.36   0.08  0.77
  4 1     0.06   0.02  0.14
  4 2     0.16   0.07  0.24
  4 3     0.49   0.28  0.54
  4 4     0.29   0.08  0.63
  5 1     0.34   0.17  0.41
  5 2     0.13   0.06  0.19
  5 3     0.30   0.17  0.34
  5 4     0.22   0.05  0.60
  6 1     0.03   0.00  0.09
  6 2     0.16   0.06  0.24
  6 3     0.53   0.27  0.60
  6 4     0.29   0.07  0.67
  7 1     0.04   0.01  0.13
  7 2     0.27   0.10  0.40
  7 3     0.13   0.04  0.26
  7 4     0.57   0.21  0.86
  8 1     0.14   0.04  0.26
  8 2     0.29   0.12  0.37
  8 3     0.12   0.04  0.21
  8 4     0.46   0.15  0.80
  9 1     0.21   0.08  0.33
  9 2     0.27   0.12  0.35
  9 3     0.10   0.03  0.19
  9 4     0.42   0.14  0.77
 10 1     0.01   0.00  0.05
 10 2     0.15   0.04  0.24
 10 3     0.57   0.23  0.67
 10 4     0.28   0.05  0.73