## Tuesday, September 25, 2012

### Going the Distance: Position in the Pack

On Saturday, September 22nd, I ran the Wörthersee Trail Maniak 57k ultramarathon (the race report is here) and as with the previous ultra I ran, I decided to see how my time stacked up against the rest of the field.  This exercise doesn't have anything to do with pursuit of my Ph.D. but it does, however, give me the excuse to program in Stata and generate a nifty graph.

As I stated in my race report, I felt like I was haulin' ass for most of this race but given my time goal (sub-seven hours for ~35 miles) I'd have to be moving.  And I was.  At least I thought I was until it seemed like the entire field was pulling away from me, one runner at a time.  My mediocre (relatively speaking) performance was confirmed once I examined the distribution of finishing times and my position therein.  The Stata code is pasted below with the graph following.
capture log close
log using worth57k_graph, replace
datetime

// program:  worth57k_graph.do
// task:  graph distribution of finish times w/ my time highlighted
// project:  drivel
// author:    cjt
// born on date:  20120925

// #0
// program setup

version 11.2
clear all
macro drop _all
set more off

// #1
// insheet CSV
insheet using "C:\Documents and Settings\cjt\Desktop\Worthersee57k\Worthersee57kResultsTrnsfr.csv"

// #2
// convert string time variables to numeric time variables
foreach var of varlist velden pyramiden finish {
gen double var'_temp = clock(var', "hms")
drop var'
rename var'_temp var'
format var' %tcHH:MM:SS
}
*end;

// #3
// identify quartiles, median, and mean for plotting on histogram
quietly su finish, det

// #4
// plot the finish times via a -histogram- along w/ lines denoting the 25th percentile,

//    median, mean, 75th percentile, as well as my time
hist finish, freq xtick(16200000(1800000)37800000) xlabel(18000000(3600000)36000000) ///
addplot(pci 0 27112000 38 27112000, lwidth(medthick) || pci 0 r(p25)' 40 r(p25)'  ///
0 r(p50)' 40 r(p50)'  0 r(p75)' 40 r(p75)', lpattern(shortdash) || pci 0 r(mean)' 40 r(mean)', ///
lpattern(longdash)) text(15 r(p25)' "25th Percentile",orientation(vertical) place(nw)) ///
text(15 r(p50)' "Median",orientation(vertical) place(nw)) text(15 r(mean)' "Mean", ///
orientation(vertical) place(ne)) text(15 r(p75)' "75th Percentile",orientation(vertical) place(ne)) ///
text(39 27112000 "Me", place(c)) text(0.5 27112000 "07:31:52", orientation(vertical) place(nw)) ///
scheme(s1color) legend(off) xtitle("Finish Time (Hours Elapsed)") ytitle("Frequency") ///
title("Worthersee Trail Maniak 57k") subtitle("Distribution of Finish Times")  ///

// #5
// -export- graph
gr export worth57k_graph.png, replace

log close
exit

The resulting graph:

As can be seen from the graph, I wasn't imagining the field pulling away from me.  The distribution of times is essentially normal --- the mean and median are nearly identical --- with my time approximately 30 minutes slower than that of the statistical middle-of-the-packer.  Ordinarily, I don't give much thought or attention to my relative position among the finishing times but with this race I couldn't resist simply because it was obvious while running the race that I'm either (a) running slower than it feels or (b) the field as a whole was running faster than I'm accustomed.  As is evident from the graph, it was probably a combination of both.

## Monday, September 24, 2012

### Wörthersee Trail Maniak 57k: Don't Walk --- Run

I felt like I was haulin' ass the entire time.  The ascents, the flats, and the descents:  haulin' ass.  It wasn't until mile 20 that the gradient became steep enough to relegate most runners to hiking.  Up until that point, however, the ascents were short enough that most people --- me included --- managed to run/jog up them.  By the time I made it to Velden (approximately half-way), my hip flexors were tight, my knees were sore, and my spirits were flagging.
 We're off!  The start of what felt like haulin' ass for 35 miles.

Prior to the race, my goal was a sub-seven hour finish but by the time I made it to Velden just over 3h:15m had elapsed so I knew any chance of a sub-seven hour finish was going to require an even or negative split and that was assuming the elevation gain/loss on the return would be similar to what was experienced heading out.  That, of course, wasn't the case since I knew the longest climb awaited (~1,000 ft over three miles or so) and that the return was longer than the outbound leg (the race circumnavigated a lake).  I wasn't yet ready to completely surrender my goal, however, so I soldiered on.

I was a little lackadaisical about jockeying for position in the pack prior to the race starting --- I started near the rear --- but once we were off and running I was passing more than being passed.  I felt good, although I was concerned about heading out too fast, too early, then falling apart near the end.  I kept doing the math over and over in my head:  "Twelve-minute miles equals five miles per hour divided by ~35 miles means approximately seven hours."  I didn't think I'd have too much problem since I had, after all, run the 50k distance before in the 6-7 hour range and those two races featured considerably more vertical gain.  This race was, of course, slightly longer but I figured that it being (relatively) flatter would translate into (relatively) faster running.  My logic was flawless.  Or so I thought.

I spent the last 8-9 miles thinking about how the race had gone and my slower-then-expected finish time.  I clearly underestimated how much faster the field runs when the course features only a modest amount of gross elevation gain/loss (1,800 meters = ~5,900 feet) and that Austrian and German ultra marathoners don't screw around.  They didn't piss around at the aid stations, there was very little chatter and small talk among runners, and, true to the stereotypes, they showed up and got down to business.  It was remarkable.  And quite humbling.  I felt like I ran a pretty fast race but either my perception of "fast" is considerably slower than what it was four years ago or the ultra runners here are just, well, faster.  Probably a combination of both.

When I finally crossed the finish line, I was 167th among 295 finishers with a time of 7h:31m:52s.  I was expecting to have finished much lower in the rankings (bottom quarter, at least) so I was pleasantly surprised by my placement.  The race organization was top-notch, the race schwag commendable, and the course varied, technical, steep, and fast enough to please virtually any ultra runner.  Just be prepared to haul ass.
 57k (~35 miles):  7h:31m:52s

## Thursday, September 20, 2012

### Robust Variance Estimate

When the variance-covariance structure of a model is mis-specified, it needs adjusting.  Mis-specification may follow from a clustering of observations, over- or under-dispersion of the variance in relation to the model-specified variation, the data don't come from a simple random sample, or the distribution of $(x_i, \epsilon_i)$ is not independent and identically distributed (i.i.d.) (Lachin 2000, Stata 11 Reference Manual).  In this situation, a robust estimate of the variance must be obtained.  I'm not sure if this variance estimate has an official undisputed name but its longest name is the Huber/White/Sandwich estimate although in the interest of brevity, I'll use robust variance estimate.

Under normal circumstances, $E(y_i) = Var(y_i)$, the maximum likelihood estimate (MLE) of $\beta$ is
$\sqrt{n}(\hat \beta_{ML} - \beta_0) \:\: \underrightarrow{d} \:\: N(0, I(B_0)^{-1})$,
where  $I(B_0)$ is the Fisher Information Matrix equal to minus the expected value of the Hessian matrix:
$I(B_0) = -E\Bigl[\frac{\partial^2 \ell(\beta; y_i, x_i)}{\partial \beta \partial \beta'}\Bigr]_\beta$
Per the previous post, the Hessian is the second derivative of the log likelihood with respect to $\beta$:
$\frac{\partial^2 \ell(\beta; y_i, x_i)}{\partial \beta \partial \beta'} = -\sum_{k=1}^N exp(x'_i \beta)x_i x'_i$
which is then negated and inverted per the properties of the MLE and Fisher Information
$I(B_0) = -E[H(B_0)] \: \rightarrow \: Var(\hat \beta) = [n I(\beta_0)]^{-1} = n^{-1} E[exp(x' \beta_0)xx']^{-1}$

When the variance is not equal to the mean --- under- or over-dispersion is present --- then a pseudo-maximum likelihood (PML) estimate must be estimated.  The only difference between the PML variance matrix and normal ML variance matrix is that the empirical estimate of the observed information is sandwiched between the model-based variance of the estimates (Lachin 2000, Winklemann 2008).  Symbolically we can denote the asymptotic ML variance matrix as
$n^{-1} I^{-1} = n^{-1} E[exp(x' \beta_0)xx']^{-1}$
and the asymptotic PML variance matrix as
$n^{-1} I^{-1} J I^{-1} = n^{-1} E[exp(x' \beta_0) xx']^{-1} \: E[(y - exp(x'\beta_0))^2 xx'] \: E[exp(x'\beta_0)xx']^{-1}$

The distribution of the Poisson PMLE (assuming a large but finite sample), then, is:
$\hat \beta \; \sim \; N \: (\beta_0, \widehat{Var}(\hat \beta))$
where $\widehat{Var}(\hat \beta)) = \Bigl[\sum_{i=1}^n x_i x'_i exp(x'_i \hat \beta)\Bigr]^{-1} \; \sum_{i=1}^n x_i x'_i \widehat{Var}(y_i \vert x_i) \; \Bigl[\sum_{i=1}^n x_i x'_i exp(x'_i \beta)\Bigr]^{-1}$
The estimation of $\widehat{Var}(y_i \vert x_i)$ follows from the outer product of the score formula, denoted thus:
$\widehat{Var}(y_i \vert x_i) = (y_i - exp(x'_i \hat \beta))^2$ .

The estimation of the Poisson variance matrix in this way then leads to consistent estimation of not only the parameter estimates but also the standard errors.  In the presence of, for example, over-dispersion, the variance (standard errors) would be underestimated resulting in an artificial inflation of the t-statistics, leading to spurious rejection of the null hypotheses.  Adjusting the variance, as in a robust Poisson regression, corrects this problem such that valid inferences are obtained (Winklemann 2008).

## Friday, September 14, 2012

### Poisson Distribution --> Regression

I'll be using the Poisson Regression Model (PRM) for my dissertation research.  This model is flexible, robust, and fairly well-behaved even when the data isn't count data.  The regression model follows from the Poisson distribution so let's start there first.

In the most general case, the Poisson process may be time-heterogeneous, thus requiring the inclusion of a time parameter, $t$, but let's assume the time intervals are constant such that the expression of the Poisson distribution omits the time parameter.  Let the Poisson distribution be generically expressed as:
$Pr(y \vert \mu) = \frac{exp(-\mu)\mu^y}{y!} \quad for \: y=0,1,2,\ldots$
where $\mu$ is the conditional mean depending on the characteristics of subject $i$.  This parameter may also be regarded as the rate parameter and since it must be positive (as dictated by the Poisson distribution) we get
$\mu_i = E(y_i \vert {\boldsymbol{x_i}}) = exp({\boldsymbol{x_i \beta}})$

which is then used to define the PRM:
$Pr(y_i \vert {\boldsymbol{x_i}}) = \frac{{\displaystyle exp(-\mu_i) \mu_{i}^{y_i}}}{{\displaystyle y_i !}}$

Since estimation of the $\beta$ coefficients is what is eventually sought (via maximum likelihood), the likelihood function and log likelihood functions need to be derived.  The likelihood function is the product of the individual conditional probability distributions understood as a function of the parameters (Winklemann 2008).  Most generically, the likelihood function is denoted as $L$ and can be written as
$L = L(\beta;y_1, \ldots ,y_n, x_i, \ldots , x_n)$
such that $\hat \beta$ is the maximization of the above (via the log likelihood).  The likelihood function of the PRM is
$L({\boldsymbol{\beta \vert y, X}}) = \prod_{i=1}^N Pr(y_i \vert \mu_i) = \prod_{i=1}^N \frac{exp(-\mu_i) \mu_i^{y_i}}{y_i !}$

The log likelihood is obtained by taking the natural log of the likelihood:
$\ell({\boldsymbol{\beta \vert y, X}}) = \sum_{i=1}^N (-\mu_i) + y_i \ln(\mu_i) - \ln(y_i !)$

Since $\mu_i$ is defined as above, substitute then simplify:
$\ell({\boldsymbol{\beta \vert y, X}}) = \sum_{i=1}^N -exp({\boldsymbol{x_i \beta}}) + y_i \ln (exp({\boldsymbol{x_i \beta}})) - \ln (y_i !)$
$\ell({\boldsymbol{\beta \vert y, X}}) = \sum_{i=1}^N -exp({\boldsymbol{x_i \beta}}) + y_i({\boldsymbol{x_i \beta}}) - \ln (y_i !)$

The maximum likelihood estimator (MLE) follows from taking the first derivative of the log likelihood with respect to $\beta$ returning a vector of values, known as the gradient vector or score vector.  This isn't the only condition for the MLE, however, the Hessian matrix (the second derivative of the log likelihood with respect to $\beta$) must also exist and be negative definite for all values of $\beta$ and $x$.  The score vector:
$\frac{\partial \ell (\beta; y, x)}{\partial \beta} = \sum_{i=1}^N [-exp(x_i \beta)x_i + y_i x_i] = \sum_{i=1}^N x_i[y_i - exp(x_i \beta)]$

The Hessian is the second derivative and is thus:
$\frac{\partial^2 \ell(\beta; y, x)}{\partial \beta \partial \beta'} = - \sum_{i=1}^N exp(x_i \beta)x_i x_i$

Unlike the MLE for the binary logit model, the MLE for the PRM follows from a series of nonlinear equations (the score vector) that must be solved via some sort of iterative algorithm (one of the most common is Newton-Raphson) which is beyond the scope of this post (and my research) since, fortunately, modern statistical software programs can easily estimate the MLEs.

Now that the PRM has been introduced it can now be made more robust via the changing of the variance to a Huber/White/Sandwich estimator (in a forthcoming post).

## Wednesday, September 12, 2012

### Copenhagen Half-Marathon

 Partly sunny day in Copenhagen
As with the previous half-marathon Lisa and I ran together, I proposed trying to run sub-two hours.  And like the last time, she didn't answer yes or no.  We just ran.  But unlike the last time, we broke two hours...and not by just a few seconds.  Lisa ran a solid 1:58:22 and I followed behind with 1:58:23.  She ran a great race, especially since the week leading up to the race she was only able to run one day.  Perhaps the minimal miles, along with the full spread of sushi the night before (gorge on pasta?  Ha!), were key.  As for me, I felt good and solid the entire time but since this wasn't a peak race for me --- the Worthersee 57k on September 22nd is --- I didn't want to overdo it.  The course featured a lot of turns, bottlenecks, and some cobblestone so I liked it but Lisa, not so much.  (It must not have been too objectionable, though, because she broke her half-marathon personal best!)  All in all, the race was well-organized, the course scenic, and the support spectacular.

## Tuesday, September 4, 2012

### Confounders: Control in Design vs. Adjust in Analysis

In my lecture and mid-term review notes from the Spring 2009 advanced epidemiology course I took, I scribbled "control in design; adjust in analysis" with respect to how confounding variables are dealt with depending on the phase of the study in which they are addressed.  In short, if the hypothesized confounder is dealt with in the design phase, it is "controlled" for whereas if it is during the statistical analysis phase, it is "adjusted" for.  This distinction might be more of a semantic issue but it is, nevertheless, one that exists and it affects how confounding variables are handled in a study.  In my introductory epidemiology textbook, "Essentials of Epidemiology in Public Health" by Aschengrau & Seage, they define confounding as the "mixing of effects between an exposure, an outcome, and a third extraneous variable known as a confounder" (p. 282) and that it can be addressed in either the design phase of the study, during the statistical analysis phase, or a combination of the two.  Confounding is a non-trivial issue in epidemiological research and, as such, a great deal of energy has been spent explaining what it is, how it affects associations between exposure and disease, how it is identified, and how it ought to be addressed when present.  Any attempt to discuss all of the aforementioned in this blog post would be both foolhardy and arrogant --- there are numerous texts available that discuss confounding at length --- so I'll touch only on the distinction highlighted above.
 Reproduced from "Epidemiology:  Beyond the Basics" (Szklo & Nieto), p. 155

Control or adjustment of a confounder requires that the analyst know (or at least suspect) which variables might present as confounders.  Some disease-exposure associations have well-known confounders (e.g. age or gender) and in this situation, steps can be taken in the design phase of the research to mitigate the effect.  The three primary means of controlling for confounding variables in the design stage are (1) randomization, (2) restriction, or (3) matching.

Randomization is the defining characteristic of the 'gold standard' study --- a randomized controlled trial (RCT) --- and ensures that, on average, known and unknown confounders will be balanced between the two treatment groups.  Randomization, however, is costly and often unrealistic or unethical in the context of epidemiological studies so this is rarely the go-to strategy for dealing with confounding.

Restriction narrows the variability of the subjects in the study by, for example, only enrolling females (eliminates the confounding effects of gender) or enrolling only subjects aged 21-25 years of age (eliminates age as a confounder).  The major drawback to this approach is the loss of generalizability of the study:  if only females were analyzed then any inferences drawn from the study apply only to females.

Matching entails the selection of subjects into the study such that potential confounders are distributed evenly between the two groups (i.e. exposed and unexposed groups).  Matching can occur either on an individual level or at a categorical level (frequency matching) but regardless of the type, the goal is to have the confounder equally distributed within both the exposed and unexposed groups.  The major drawback to matching is that the matching variable cannot be analyzed statistically, i.e. the association between the confounder (matching variable) and the outcome can no longer be estimated.

The three primary means of handling confounding in the analysis phase are (1) standardization, (2) stratification, and (3) multivariate methods.  All of these approaches are employed after the study design phase and data collection.  The first strategy, standardization, is usually one of the first principles taught in an elementary epidemiology course and is of two flavors:  direct and indirect.  Standardization is often used to control for standard demographics (e.g. race, gender, or age) and recasts, for example, the mortality rates between two groups as if they had the same race/gender/age distribution.

Stratification is a bit more complicated but it is, essentially, the separation of results into discrete groups.  Once the groups are determined and created (according to the suspected confounder), the measure of association (e.g. odds ratio) is estimated for each group (strata) then compared to the overall measure of effect.  If the strata-specific odds ratios are close to each other but there is an appreciable difference between them and the overall odds ratio, then confounding is likely to exist.  Stratum-specific odds ratios should be reported in this instance whereas if there is no appreciable difference between the stratum-specific odds ratios and the overall odds ratio then a single, summary measure of effect can be reported (the statistically weighted Mantel-Haenszel estimate).  There are, of course, exceptions to this simplistic guideline and the situation when working with real data can be considerably more complex and nuanced -- confounding can be positive, negative, or qualitative -- but the general thrust of the strategy remains:  stratum-specific estimates being compared to an overall estimate.  The main disadvantage of stratified analysis is that stratification isn't really viable when multiple variables (confounders) exist, thus creating a situation where not every strata will be populated with study subjects.  Estimation of odds ratios/relative risks/etc. in this situation leads to unstable and unreliable results.  The alternative in this situation -- and the one that seems to be the prominent method given the power of modern computing -- is to conduct a multivariate analysis.

In a multivariate analysis (the brunt of my dissertation analysis relies on multivariate analysis), a statistical model is created (e.g. ordinary least squares regression, Poisson regression, logistic regression) wherein the association between the disease & exposure and disease & confounders, simultaneously, is estimated.  Inserting all the variables into the model in this way allows for the control of various confounding variables simultaneously.  As powerful as multivariate analysis (i.e. regression modeling) is, it shouldn't be willy-nilly.  The power and ease with which statistical models can be built is astonishing and unless the objective is to dredge the data and explore for associations (exploratory data analysis), the model building should be purposeful and thoughtful and follow from the study hypothesis/es.