Logistic Regression

Business Analysis Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

3.3 Logistic Regression

Logistic regression is used to model the relationship between a binary response variable and one or more predictor variables, which may be either discrete or continuous. Binary outcome data is common in medical applications. For example, the binary response variable might be whether or not a patient is alive five years after treatment for cancer or whether the patient has an adverse reaction to a new drug. As in multiple regression, we are interested in finding an appropriate combination of predictor variables to help explain the binary outcome.

Let Y be a dichotomous random variable denoting the outcome of some experiment, and let X = (x₁, x₂, ... , x_{p – 1}) be a collection of predictor variables. Denote the conditional probability that the outcome is present by P(Y = 1|x) = π(x), where π(x) has the form:

If the x_j are varied and the n values Y₁,Y₂, ... , Y_n of Y are observed, we write:

The logistic regression problem is then to obtain an estimate of the vector:

As with multiple linear regression, the matrix:

is called the regression matrix, while the matrix R, containing only the data for the predictor variables (matrix X without the leading column of 1s), is called the predictor data matrix.

3.3.1 Parameter Calculation

The method used to find the parameter estimates is the method of maximum likelihood. Specifically, is the value that maximizes the likelihood function:

The log of this equation is called the log likelihood, and is defined as:

3.3.2 Parameter Variances and Covariances

Estimates for the variances and covariances of the estimated parameters are computed using the following equations.

Let , where X is the matrix regression matrix, and V is an diagonal matrix with i^th diagonal term π_i(1 – π_i). That is, the matrix X is:

and the matrix V is:

Denote:

The estimate of the variance of is then the j^th diagonal term of the matrix , and the off-diagonal terms are the covariance estimates for and .

3.3.3 Significance of the Model

In practice, several different measures exist for determining the significance, or goodness of fit, of a logistic regression model. These measures include the G statistic, Pearson statistic, and Hosmer-Lemeshow statistic. In a theoretical sense, all three measures are equivalent. To be more precise, as the number of rows in the predictor matrix goes to infinity, all three measures converge to the same estimate of model significance. However, for any practical regression problem with a finite number of rows in the predictor matrix, each measure produces a different estimate.

Commonly a regression model designer refers to more than one measure. If any single measure indicates a low goodness of fit, or if the measures differ greatly in their assessments of significance, the designer goes back and makes improvements to the regression model.

3.3.3.1 G Statistic

Perhaps the most straightforward measure of a goodness of fit is the G statistic, also refer to as the "likelihood ratio test." It is a close analogue to the F statistic for linear regression. Both the F statistic and the G statistic measure a difference in deviance between two models. For logistic regression, the deviance of a model is defined as:

To determine the overall significance for a model using the G statistic, the deviance for the model and the deviance for the intercept-only model are subtracted. The larger the difference, the greater the evidence that the model is significant. The G statistic follows a chi-squared distribution with p – 1 degrees of freedom, where p is the number of parameters in the model. Significance tests based on this distribution are supported in the Business Analysis Module.

3.3.3.2 Pearson Statistic

The Pearson statistic is a model significance measure based more directly on residual prediction errors. In the most straightforward implementation of the Pearson statistic, the predictor matrix rows are placed into J groups such that identical rows are placed in the same group. Then the Pearson statistic is obtained by summing over all J groups:

where o_jis the number of positive observations for group j, π_j is the model's predicted value, and mj is the number of identical rows. The Pearson statistic follows a chi-squared distribution with J – p – 1 degrees of freedom, where p is the number of parameters in the model. Significance tests based on this distribution are supported in the Business Analysis Module.

Because the accuracy of this statistic is poor when predictor variable data are continuous-valued, the statistic in our implementation is obtained by grouping the predictor variable data. In other words, the data values for each predictor variable are replaced with integer values, the logistic regression parameters are recalculated, and the statistic is obtained from the resulting model. This tends to make the value of J much smaller, and the Pearson statistic becomes more accurate. In the -Business Analysis Module, the default number of groups for each predictor variable is 2.

3.3.3.3 Hosmer-Lemeshow Statistic

The Hosmer-Lemeshow statistic takes an alternative approach to grouping: it groups the predictions of a logistic regression model rather than the model's predictor variable data, which is the Pearson statistic's approach. In the implementation found in the Business Analysis Module, model predictions are split into G bins that are filled as evenly as possible, sometimes called "equal mass binning." Then the statistic is computed as:

where o_j is the number of positive observations in group j, π_j is the model's average predicted value in group j, and n_j is the size of the group. The Hosmer-Lemeshow statistic follows a chi-squared distribution with G – 2 degrees of freedom. In the Business Analysis Module, the default value for G is 10.

3.3.4 Parameter Significance (Wald Test)

For each estimated parameter , the Wald chi-square statistic is the quantity:

where is the estimated variance of as defined in Section 3.3.2.

3.3.4.1 p-Values

The p-value for each parameter estimate is the probability of seeing the value of the calculated parameter using the above formula, or something more extreme, if the hypothesis βj = 0 is true. Note that in general the sample size must be large in order for the p-value to be accurate.

3.3.4.2 Critical Values

The critical values, vj , for the parameter estimates are the levels at which, if the absolute value of the Wald chi-square statistic calculated for a given is greater than v_j, we reject the hypothesis β_j = 0 at the specified significance level.

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.