Maximum Likelihood estimation

In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.

The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, one may be interested in the heights of adult female giraffes, but be unable due to cost or time constraints, to measure the height of every single giraffe in a population. Assuming that the heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable (given the model).

In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects values of the model parameters that produce a distribution that gives the observed data the greatest probability (i.e., parameters that maximize the likelihood function). Maximum-likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems. However, in some complicated problems, difficulties do occur: in such problems, maximum-likelihood estimators are unsuitable or do not exist.

This page depicts some likelihood functions for parameter estimation in R. Particularly, I illustrate how the parameters of the normal (Gaussian), Gamma, Cauchy and multivariate normal distribution can be estimated from a sample. Evaluation of the negative log-likelihood function is performed using optim {stats} which provides a general-purpose optimization based on Nelder-Mead, quasi-Newton and conjugate-gradient algorithms.


The normal distribution

The normal (or Gaussian) distribution is a continuous probability distribution that is often used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value. The graph of the associated probability density function is "bell"-shaped, and is known as the Gaussian function or bell curve.

Hessian Matrix and Standard Errors: Note that standard errors of the parameter estimates can be obtained from the inverse of the Hessian matrix. The Hessian matrix describes the local curvature of the likelihood function at the estimated parameter values and therefore indicates their stability. To calculate the Hessian matrix, specify hessian=T in the optim-command. The standard errors of the estimated mean and standard deviation in the code above is then given by sqrt(diag(solve(mle$hessian))[1]) and sqrt(diag(solve(mle$hessian))[2]) respectively.

The Gamma distribution

The Gamma distribution is a two-parameter family of continuous probability distributions and is described by a scale and shape parameter. It is related to the beta distribution and frequently used for modelling waiting times or variables that seem to be highly skewed (similar to the lognormal distribution). Special cases of the Gamma distribution include the exponential, chi-squared and Erlang distribution. The example below estimates the shape and rate parameter of the Gamma distribution. Note that the rate parameter is equivalent to the inverse of the scale parameter.

The Cauchy distribution

The Cauchy-Lorentz distribution (also known as the Lorentz distribution, Lorentzian function, or Breit-Wigner distribution) is a continuous probability distribution with 2 parameters. It has a similar shape as the normal distribution, but higher tails. This implies that it is less sensitive to outliers and may be used for robust parameter estimation (e.g. to model a Gaussian family in small samples). Other, well-known applications of the Cauchy distribution arise in physics and mathematics.

Multivariable Linear Regression

Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. A linear regression model attempts to explain the relationship between two or more variables using a straight line.

Multivariable Logistic Regression

Please see this chapter.

Multivariate normal distribution

The multivariate normal distribution or multivariate Gaussian distribution is a widely used standard joint distribution. It is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. A random vector is said to be multivariate normally distributed if every linear combination of its components has a univariate normal distribution. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

The Cholesky decomposition: Note that for small samples, estimated covariance matrices may not be well-defined. A possible solution is to use the Cholesky decomposition. The Cholesky decomposition or Cholesky triangle is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose. It was discovered by André-Louis Cholesky for real matrices and is an example of a square root of a matrix. The decomposition ensures that estimated covariance matrices remain positive definite.

Multivariate skew normal distribution

The multivariate skew normal distribution (MVSN) is a continuous probability distribution that generalises the multivariate normal distribution to allow for asymmetry. It is described by three parameters: location, scale covariance and skewness. Skewness appears occasionally in real data, as pointed out by Hill & Dixon (1982). The image below illustrates the multivariate normal skew distribution.

Multivariate skew normal distribution

Unfortunately, there is no explicit analytical solution for the parameter estimates of the MVSN; hence maximum likelihood estimation may provide a heuristic solution. We consider the specific scenario described by Sahu et al. (2003) in which the covariance matrix is diagonal. The R code below was written with the help of Sylvia Soltyk.

Below, we illustrate how the multivariate normal skew distribution can be altered to allow non-diagonal covariance matrices. Note that parameter estimation may become computationally demanding as the integrations in the density function are no longer independent.

Multivariate skew normal mixture models

A finite mixture of distributions, in particular the use of normal components, has received considerable attention and is known to be a very powerful tool for modeling an extremely wide variety of random phenomena. Most importantly, mixture modeling has been a favorable model-based technique in handling supervised classification and unsupervised clustering problems. This section illustrates a mixture modeling based on a class of multivariate skew normal distributions proposed by Sahu et al. The corresponding algorithm was developed by Tsung I. Lin, and considers an iterative procedure for obtaining maximum likelihood estimates of model parameters via the expectation-maximization algorithm.

Expectation-maximization (EM) is a method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. EM is an iterative method which alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.


[ Back ]




[ Back ]