In this blog article, we’ll be using an example to compare two approaches to estimating the parameters of a logistic regression. On the one hand, the parameters are estimated by using Maximum Likelihood Estimation (MLE). On the other, a Markov-Chain-Monte-Carlo (MCMC) simulation. In particular, the resulting confidence and credible intervals are compared.
The dataset used contains information on credit card clients from Taiwan. In addition to information on default payments, demographic details of the clients are also available. A total set of 7 features was selected. The goal is to estimate the parameters of a logistic function which is utilized to model the probability of default:
The log-Likelihood of observing the data given model parameters is defined by:
Where is a binary variable that indicates whether the credit card user is defaulted () or not (). To stabilize the solution, a regularization term with strength is added to the likelihood. The bias introduced due to the regularization will not be investigated further in this article. However, it should be noted at this point that the regularization can lead to an unwanted correlation between the model parameters. This influence can be investigated quantitatively to estimate a suitable regularization parameter.
def logit(X, popt): ''' Logistic Regression ''' return (np.exp(np.dot(X, popt)) / (1 + np.exp(np.dot(X, popt)))) def llh(popt, X, sgn = 1, tau = 1): ''' Negative log-Likelihood. ''' y = X[:,-1] # default 0/1 # use sgn = 1 if maximizing likelihood and sgn = -1 if minizming llh = np.sum((1 - y) * np.log(1 - logit(X[:,:-1], popt)) + y * np.log(logit(X[:,:-1], popt))) \ - tau * np.linalg.norm(popt) return sgn * llh
In practice, the negative log likelihood is minimized (
sgn = -1 ):
# initial guess p0 = np.ones(X.shape[-1] - 1) / 100 # standard deviations obtained from inverted hessian matrix res = minimize(llh, p0, method = 'bfgs', args = (X, -1, 1), tol = 1e-3) errors = np.sqrt(np.diag(res.hess_inv)) print(res.success) >> True
The covariance matrix of the estimated parameters is obtained by inverting the Hessian matrix. Another approach to parameter estimation, besides the frequentistic approach just described, is the MCMC method to generate samples from the posterior given a likelihood , prior and evidence .
The posterior probability distribution is defined by Bayes’ theorem:
The most famous algorithm to draw samples from the posterior distribution is the Metropolis-Hastings-Algorithm. Here a random walker moves across the model parameter space and the proposed positions are accepted or denied, under consideration of the detailed balance condition, with the posterior probability given the parameter value, i.e. the walkers’ position. The algorithm used in this article uses an ensemble of random walkers and is implemented in the Python library
emcee . Since the absolute probabilities are not required, but only the probabilities relative to each other, the evidence which is not a function of the model parameters is neglected. Using a flat prior, the MCMC simulation is implemented as follows:
def prior(popt): ''' Prior ''' return 0 def log_fn(popt, X, sgn = 1): ''' log Posterior ''' l = llh(popt, X, sgn) p = prior(popt) return l + p # mcmc setup ndim = X.shape[-1] - 1 # only features nwalkers = 100 n_steps = 1000 n_burnin = 200 noise = np.abs(np.random.normal(size = (nwalkers, ndim))) / 100 initial_state = np.repeat(res.x, nwalkers).reshape(nwalkers, ndim) * noise # initialize sampler sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob_fn = llh, args = [X, 1]) # burn-in-phase state = sampler.run_mcmc(initial_state, n_burnin, progress = True) sampler.reset() # mcmc sampler.run_mcmc(state, n_steps, progress = True) chain = sampler.get_chain(flat = True)
The walkers are randomly initialized at positions
initial_state. The introduction of a noise to the
initial state is necessary so that the positions of the walkers are linearly independent of each other. In order to let the walkers explore the parameter space, a burn-in phase of 200 steps is performed before the actual simulation.
After drawing samples from the posterior it is now possible to histogram the samples projected into the model parameter subspace given by .
As can be seen in Figure 1, the estimated credible and confidence intervals show a remarkable agreement. This behavior was to be expected, since no prior belief was incorporated into the posterior by choosing a uniform prior. In other words, in the initial belief all model parameters are equally likely. The differences between frequentist and Bayesian reasoning are fundamental, a start might be . In frequentist (or classical) inference, probabilities are seen as a long run frequencies while in Bayesian inference probabilities are interpreted as degrees of belief which should be examined and analyzed. From a Bayesian perspective, a parameter is considered a random variable with an underlying distribution. Also, probabilities are assigned to non-repeatable events which is not meaningful from a classical point of view.
The estimated parameters given by the median of the posterior and the MLE are shown in Table 1. To compare the estimated uncertainties, Table 2 shows the one-sigma (denoted as
sigma) confidence intervals from the MLE and the averaged credible interval (denoted as
q) from the MCMC simulation.
The obtained results of the estimated parameters are similar. However, the credible interval (
q) for the parameters and is considerably wider than the associated one-sigma confidence interval (
If you are interested in the entire code and would like to reproduce the results, please send us an e-mail.
Resources Foreman-Mackey, D. et al. “emcee: The MCMC Hammer.” In: Publications
of the Astronomical Society of the Pacific 125.925 (Mar. 2013), pp. 306–312.  Goodman, J. and Weare, J. “Ensemble samplers with affine invariance.” In: Commun. Appl. Math. Comput. Sci. 5.1 (2010), pp. 65–80.  Shalizi, C. “Advanced Methods of Data Analysis.” (2012), pp. 223-228.
URL: https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf Wassermann, L. and Liu, H. “Statistical Machine Learning.” (2015), pp. 299-351.