Logistic Regression
Logistic regression is a type of regression used when the dependant variable is binary or ordinal (e.g. when the outcome is either "dead" or "alive"). It is commonly used for predicting the probability of occurrence of an event, based on several predictor variables that may either be numerical or categorical. For example, suppose a researcher is interested in how Graduate Record Exam scores (GRE) and grade point average (GPA) effect admission into graduate school. By deriving a logistic regression model from previously observed admissions (we will use an hypothetical dataset from the UCLA Academic Technology Services here), it becomes possible to predict future admissions.
mydata = read.csv(url('http://www.ats.ucla.edu/stat/r/dae/binary.csv'))
In this dataset, gre, gpa and rank represent the predictor variables, and admit the outcome. A typical regression analysis using pre-established packages from R could then be applied as follows:
mylogit = glm(admit~gre+gpa+as.factor(rank), family=binomial, data=mydata)
However, in order to understand the mechanisms of logistic regression we can write out its likelihood function. We will employ maximum likelihood estimation (MLE) to find the optimal parameters values, here represented by the unknown regression coefficients:
We can implement this function as follows:
mydata$rank = factor(mydata$rank) #Treat rank as a categorical variable
fmla = as.formula("admit~gre+gpa+rank") #Create model formula
mylogit = mle.logreg(fmla, mydata) #Estimate coefficients
mylogit
Note that the categorical variable rank is modeled as a factor. This implies that a separate regression coefficient is estimated for ranks 2, 3 and 4 (with rank 1 as reference). We can validate the resulting model by using the package rms which was developed by Prof. Frank Harrell. Because there is no independent dataset available, we use the bootstrap procedure. Note that this approach yields overoptimistic results.
library(rms) dfrTmp = model.frame(mydata) x = as.matrix(model.matrix(fmla, data=dfrTmp)) y = as.matrix(mydata[,"admit"]) logit = x%*%mylogit$beta val.prob(y=y,logit=logit)
The resulting calibration plot is given below, and illustrates the extent to which the predicted and actual probabilities agree. Note that the discrimination (as indicated by the C-statistic) and calibration (as indicated by the calibration slope) of the model is rather poor. This may indicate that the included variables do not sufficiently predict the outcome.
Instead of obtaining the observed information matrix from the numerically differentiated Hessian matrix (through the optim-command), it is possible to calculate an unbiased estimate directly from the data:
Finally, in some scenarios it is necessary to constrain the parameter search space. For instance, in stacked regressions it is important to put non-negative constraints on the regression slopes. This can be achieved by a small modification in the optim-command: