## Friday, September 14, 2012

### Poisson Distribution --> Regression

I'll be using the Poisson Regression Model (PRM) for my dissertation research.  This model is flexible, robust, and fairly well-behaved even when the data isn't count data.  The regression model follows from the Poisson distribution so let's start there first.

In the most general case, the Poisson process may be time-heterogeneous, thus requiring the inclusion of a time parameter, $t$, but let's assume the time intervals are constant such that the expression of the Poisson distribution omits the time parameter.  Let the Poisson distribution be generically expressed as:
$Pr(y \vert \mu) = \frac{exp(-\mu)\mu^y}{y!} \quad for \: y=0,1,2,\ldots$
where $\mu$ is the conditional mean depending on the characteristics of subject $i$.  This parameter may also be regarded as the rate parameter and since it must be positive (as dictated by the Poisson distribution) we get
$\mu_i = E(y_i \vert {\boldsymbol{x_i}}) = exp({\boldsymbol{x_i \beta}})$

which is then used to define the PRM:
$Pr(y_i \vert {\boldsymbol{x_i}}) = \frac{{\displaystyle exp(-\mu_i) \mu_{i}^{y_i}}}{{\displaystyle y_i !}}$

Since estimation of the $\beta$ coefficients is what is eventually sought (via maximum likelihood), the likelihood function and log likelihood functions need to be derived.  The likelihood function is the product of the individual conditional probability distributions understood as a function of the parameters (Winklemann 2008).  Most generically, the likelihood function is denoted as $L$ and can be written as
$L = L(\beta;y_1, \ldots ,y_n, x_i, \ldots , x_n)$
such that $\hat \beta$ is the maximization of the above (via the log likelihood).  The likelihood function of the PRM is
$L({\boldsymbol{\beta \vert y, X}}) = \prod_{i=1}^N Pr(y_i \vert \mu_i) = \prod_{i=1}^N \frac{exp(-\mu_i) \mu_i^{y_i}}{y_i !}$

The log likelihood is obtained by taking the natural log of the likelihood:
$\ell({\boldsymbol{\beta \vert y, X}}) = \sum_{i=1}^N (-\mu_i) + y_i \ln(\mu_i) - \ln(y_i !)$

Since $\mu_i$ is defined as above, substitute then simplify:
$\ell({\boldsymbol{\beta \vert y, X}}) = \sum_{i=1}^N -exp({\boldsymbol{x_i \beta}}) + y_i \ln (exp({\boldsymbol{x_i \beta}})) - \ln (y_i !)$
$\ell({\boldsymbol{\beta \vert y, X}}) = \sum_{i=1}^N -exp({\boldsymbol{x_i \beta}}) + y_i({\boldsymbol{x_i \beta}}) - \ln (y_i !)$

The maximum likelihood estimator (MLE) follows from taking the first derivative of the log likelihood with respect to $\beta$ returning a vector of values, known as the gradient vector or score vector.  This isn't the only condition for the MLE, however, the Hessian matrix (the second derivative of the log likelihood with respect to $\beta$) must also exist and be negative definite for all values of $\beta$ and $x$.  The score vector:
$\frac{\partial \ell (\beta; y, x)}{\partial \beta} = \sum_{i=1}^N [-exp(x_i \beta)x_i + y_i x_i] = \sum_{i=1}^N x_i[y_i - exp(x_i \beta)]$

The Hessian is the second derivative and is thus:
$\frac{\partial^2 \ell(\beta; y, x)}{\partial \beta \partial \beta'} = - \sum_{i=1}^N exp(x_i \beta)x_i x_i$

Unlike the MLE for the binary logit model, the MLE for the PRM follows from a series of nonlinear equations (the score vector) that must be solved via some sort of iterative algorithm (one of the most common is Newton-Raphson) which is beyond the scope of this post (and my research) since, fortunately, modern statistical software programs can easily estimate the MLEs.

Now that the PRM has been introduced it can now be made more robust via the changing of the variance to a Huber/White/Sandwich estimator (in a forthcoming post).