Parameter Estimation using Maximum Likelihood Estimation and Markov-Chain-Monte-Carlo

Reading Time: 4 minutes

 

In this blog article, we’ll be using an example to compare two approaches to estimating the parameters of a logistic regression. On the one hand, the parameters are estimated by using Maximum Likelihood Estimation (MLE). On the other, a Markov-Chain-Monte-Carlo (MCMC) simulation. In particular, the resulting confidence and credible intervals are compared.

The dataset used contains information on credit card clients from Taiwan. In addition to information on default payments, demographic details of the clients are also available. A total set of 7 features was selected. The goal is to estimate the parameters \bf{b} of a logistic function which is utilized to model the probability of default:

(1)   \begin{equation*} f(\bf{X}, \bf{b}) = \frac{\exp(\sum_i\,X_i\,b_i)}{1 + \exp(\sum_i\,X_i\,b_i)}. \end{equation*}

The log-Likelihood of observing the data \bf{X} given model parameters \bf{b} is defined by:

(2)   \begin{equation*} \ln \left( \mathcal{L}(\bf{b}|\bf{X}) \right) = \sum_i \left[ (1 - y_i)\log(1 - f(\bf{X}_i, \bf{b}) + y_i \log(f(\bf{X}_i, \bf{b}) \right] - \tau ||\bf{b}||_2 \right. \end{equation*}

Where y_i is a binary variable that indicates whether the credit card user is defaulted (y = 1) or not (y = 0). To stabilize the solution, a regularization term with strength \tau = 1 is added to the likelihood. The bias introduced due to the regularization will not be investigated further in this article. However, it should be noted at this point that the regularization can lead to an unwanted correlation between the model parameters. This influence can be investigated quantitatively to estimate a suitable regularization parameter.

def logit(X, popt):
    ''' Logistic Regression '''
    return (np.exp(np.dot(X, popt)) / (1 + np.exp(np.dot(X, popt))))

def llh(popt, X, sgn = 1, tau = 1):
    ''' Negative log-Likelihood. '''
    y = X[:,-1] # default 0/1
    
    # use sgn = 1 if maximizing likelihood and sgn = -1 if minizming
    llh = np.sum((1 - y) * np.log(1 - logit(X[:,:-1], popt)) + y * np.log(logit(X[:,:-1], popt))) \
        - tau * np.linalg.norm(popt)
            
    return sgn * llh

In practice, the negative log likelihood is minimized (sgn = -1 ):

# initial guess
p0 = np.ones(X.shape[-1] - 1) / 100

# standard deviations obtained from inverted hessian matrix
res = minimize(llh, p0, method = 'bfgs', args = (X, -1, 1), tol = 1e-3)
errors = np.sqrt(np.diag(res.hess_inv))

print(res.success)
>> True

The covariance matrix of the estimated parameters is obtained by inverting the Hessian matrix. Another approach to parameter estimation, besides the frequentistic approach just described, is the MCMC method to generate samples from the posterior given a likelihood p(x|\theta), prior p(\theta) and evidence p(x).

The posterior probability distribution is defined by Bayes’ theorem:

(3)   \begin{equation*} p(\theta|x) = \frac{p(x|\theta)\,p(\theta)}{p(x)} \propto p(x|\theta)\,p(\theta). \end{equation*}

The most famous algorithm to draw samples from the posterior distribution is the Metropolis-Hastings-Algorithm. Here a random walker moves across the model parameter space and the proposed positions are accepted or denied, under consideration of the detailed balance condition, with the posterior probability given the parameter value, i.e. the walkers’ position. The algorithm used in this article uses an ensemble of random walkers and is implemented in the Python library emcee [1]. Since the absolute probabilities are not required, but only the probabilities relative to each other, the evidence p(x) which is not a function of the model parameters is neglected. Using a flat prior, the MCMC simulation is implemented as follows:

def prior(popt):
    ''' Prior '''
    return 0

def log_fn(popt, X, sgn = 1):
    ''' log Posterior '''
    l = llh(popt, X, sgn)
    p = prior(popt)
    
    return l + p

# mcmc setup
ndim = X.shape[-1] - 1 # only features
nwalkers = 100
n_steps = 1000
n_burnin = 200
noise = np.abs(np.random.normal(size = (nwalkers, ndim))) / 100
initial_state = np.repeat(res.x, nwalkers).reshape(nwalkers, ndim) * noise

# initialize sampler
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob_fn = llh, args = [X, 1])

# burn-in-phase
state = sampler.run_mcmc(initial_state, n_burnin, progress = True)
sampler.reset()

# mcmc
sampler.run_mcmc(state, n_steps, progress = True)
chain = sampler.get_chain(flat = True)

The walkers are randomly initialized at positions initial_state. The introduction of a noise to the initial state is necessary so that the positions of the walkers are linearly independent of each other. In order to let the walkers explore the parameter space, a burn-in phase of 200 steps is performed before the actual simulation.

After drawing samples from the posterior it is now possible to histogram the samples projected into the model parameter subspace given by \bf{b}.

Figure 1: Maximum Likelihood Estimates (orange) with one-sigma confidence interval (shaded orange). Distribution of drawn samples using MCMC simulation (blue). Four arbitrarily selected parameters are illustrated.

As can be seen in Figure 1, the estimated credible and confidence intervals show a remarkable agreement. This behavior was to be expected, since no prior belief was incorporated into the posterior by choosing a uniform prior. In other words, in the initial belief all model parameters are equally likely. The differences between frequentist and Bayesian reasoning are fundamental, a start might be [4]. In frequentist (or classical) inference, probabilities are seen as a long run frequencies while in Bayesian inference probabilities are interpreted as degrees of belief which should be examined and analyzed. From a Bayesian perspective, a parameter is considered a random variable with an underlying distribution. Also, probabilities are assigned to non-repeatable events which is not meaningful from a classical point of view.

Figure 2: Trace plot (left) after burn-in phase and posterior distribution (right) for parameter b_5.

The estimated parameters given by the median of the posterior and the MLE are shown in Table 1. To compare the estimated uncertainties, Table 2 shows the one-sigma (denoted as sigma) confidence intervals from the MLE and the averaged credible interval q = 0.5 (q_{84} - q_{16}) (denoted as q) from the MCMC simulation.

Table 1: Estimated parameters using MLE and MCMC simulation.
 b_1b_2b_3b_4b_5b_6b_7
MLE-0.148914-0.060551-0.20010.6243840.204921-4.047950.473678
MCMC-0.150952-0.0609509-0.2001820.62470.204633-3.998940.425098
Table 2: Standard deviations obtained from the MLE and credible interval using MCMC simulation.
 b_1b_2b_3b_4b_5b_6b_7
sigma0.02702970.01843320.02630320.01753050.01567460.3907840.390283
q0.02768880.01852380.02599860.0172850.01458140.6437950.680995

The obtained results of the estimated parameters are similar. However, the credible interval (q) for the parameters b_6 and b_7 is considerably wider than the associated one-sigma confidence interval (sigma).

If you are interested in the entire code and would like to reproduce the results, please send us an e-mail.

Resources

[1] Foreman-Mackey, D. et al. “emcee: The MCMC Hammer.” In: Publications
of the Astronomical Society of the Pacific 125.925 (Mar. 2013), pp. 306–312.

[2] Goodman, J. and Weare, J. “Ensemble samplers with affine invariance.” In: Commun. Appl. Math. Comput. Sci. 5.1 (2010), pp. 65–80.

[3] Shalizi, C. “Advanced Methods of Data Analysis.” (2012), pp. 223-228.

URL: https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

[4] Wassermann, L. and Liu, H. “Statistical Machine Learning.” (2015), pp. 299-351.

URL: http://www.stat.cmu.edu/~larry/=sml/Bayes.pdf

Print Friendly, PDF & Email