Wednesday, December 21, 2011

Progression-Free Survival

A close friend of mine -- an insanely productive medical clinician -- asked me to assist him and a colleague on an analysis of survival time where the endpoint was "Progression-Free Survival" (PFS).  In my last job as a biostatistician for a contract research organization (CRO) specializing in the conduct and execution of NIH-funded clinical trials, I first encountered this endpoint -- as well as others that differed in small but minor ways -- but just took for granted what the endpoints meant since they had already been defined and programmed.  Now that I was in the position of explicitly defining and programming the endpoint for their analysis, I figured it would be prudent to research how exactly PFS is defined according to experts who specialize in these things.

A simple PubMed search of "progression-free survival" returned hundreds of hits -- many studies use PFS as an endpoint -- so I narrowed the search a bit by adding "definition" to the search terms.  Considerably fewer articles were identified and among the ones I found, there was a reference to a 2007 article published in the Journal of Clinical Oncology titled "Revised Response Criteria for Malignant Lymphoma" (2007 Feb 10;25(5):579-86) by Cheson, Pfistner, Juweid, et al..  This article clearly defines progression-free survival, overall survival (OS), event-free survival, as well as others, with a view toward making endpoint definitions more uniform in lymphoma studies.  (Click here to access the full article in a new window.)

The authors summarize and describe eight endpoints with two considered primary endpoints -- overall survival and progression-free survival -- with the remaining six considered secondary endpoints:  event-free survival, time to progression, disease-free survival, response duration, lymphoma-specific survival, and time to next treatment.  Overall survival is the most clear-cut and least ambiguous:  time from enrollment/entry/randomization to death from any cause.  Progression-free survival is slightly more complicated than OS in that in addition to death from any cause, disease progression also constitutes an event/endpoint.  The authors explain:  "PFS reflects tumor growth, and therefore is interpretable earlier than the end point of overall survival.  In addition, PFS is not confounded by the administration of subsequent therapy."

Among the secondary endpoints, event-free survival (EFS) -- also referred to as "time to treatment failure" -- and disease-free survival (DFS) are the two I've encountered.  EFS is calculated as the time from study entry/enrollment/randomization to any treatment failure, i.e. disease progression or discontinuation of treatment for any reason (disease progression, toxicity, patient request, new treatment initiation, or death).  According to the authors, this endpoint is generally frowned upon by regulators since it combines efficacy, toxicity, and patient withdrawal.  DFS, unlike the other endpoints mentioned, starts to measure survival time from the occurrence of a disease-free state or the attainment of complete remission and terminates with either disease recurrence, death resulting from the lymphoma, or onset of acute toxicity of treatment.  This endpoint is susceptible to bias from the correct/incorrect attribution of death and whether death is censored or not.

The following table was copied-and-pasted from the article and presents a concise definition of all eight endpoints.  

In order to analyze survival data in Stata, you need to first declare the data as "survival time" with the -stset- command.  This command can be used for both single-record and multiple record survival data and requires, at a minimum, a variable containing the observation time.  If you have censored data, a variable indicating whether the observation was censored or not also must be included.  I first calculated the survival time as the difference between the patient's diagnosis date and either the event date, death date, or last contact date (whichever occurred first).  Subjects who did not experience the event or died were censored.  The censor variable assumed two values -- zero (censored) or one -- and was a composite of the event (failure) and death variables   Finally, the data was set up for survival analysis by way of -stset-.  Snippets of code follow:

* **survival-time calculated in days
gen ftime_days = min(fail_dt, dthdt, lastctdt) - sj_dx_date
 label var ftime_days "Survival Time (days)"

* **generation of censor variable
gen fail_death = 1 if fail==1 | survstatus==1
 replace fail_death = 0 if missing(fail_death)
 label var fail_death "Failure/Death Indicator"
* **-stset- the data using follow-up time (days)
stset ftime_days, failure(fail_death==1)

Wednesday, November 30, 2011

Budapest Half

2011 hasn't been my best year of running.  The Vienna Marathon started out promising enough but once the short-sightedness of a decision to wear an ill-fitting minimalist shoe for the race became much more real (and painful), it then felt more like a death march to the end.  Then the onset of acute tendonitis in my lower left calf sidelined me for the latter half of June, all of July, and the first half of August resulting in my withdrawal from the Mountainman Ultra 80k in Lucerne, Switzerland in early August.  And then, in early September I ran the Budapest Half Marathon with the last-minute plan to run sub-2:00.  I've run a half marathon under two hours before and have a marathon PR of 3:39 so I was hopeful that I'd cross the finish line in under two hours even though my training had been pretty casual and eschewed any speed work.  But 'twas not to be:  I finished in 2:00:56.  I suppose I shouldn't have been too disappointed since it was unseasonably hot, the first water station was empty by the time the mid-packers arrived, and my training was lackadaisical but, alas, I was.  Since Budapest, I've been running 3-5 days per week with no goal other than to get out of the apartment, maintain some semblance of fitness, and continue to rehab my left calf.  I've started researching ultras for 2012 and have a couple in mind although nothing is definitive.  I'll update this blog once I decide on what race -- or races -- I'll run.  In the meantime, here's to a better year of running in 2012!

Tuesday, November 29, 2011

Optimistic About Beer

In the spirit of prudence -- I'm not sure what boundaries exist with respect to disclosure of my research topic to a larger, non-academic audience -- I haven't described exactly what my research topic is.  Hell, I haven't even given a general description.  And aside from the title of this post -- "Optimism" -- not much else will be disclosed.  That being said, I'm not feeling terribly optimistic lately about the progress of my research.  A timeline I charted a few months ago had my committee reviewing a first draft of my proposal at this point -- I have yet to even assemble and write a first draft.   The problem isn't the actual writing, per se, but the absence of a detailed and executable statistical analysis plan.  In an ideal situation, the statistical analysis plan would follow naturally from the research question/objective/hypothesis and data structure (cohort, case/control, RCT, cross-sectional, survey, etc.), but alas, an ideal is just that:  ideal and rarely realized.  In my situation, I know where I need to go -- Google Maps is cued up and ready to go -- but I can't find my car.  I'm not even sure I'm searching for it in the right parking lot.  But I continue to search, albeit with the occasional distraction:  the most recent being the writing of a Stata program that created a series of variables each containing a random shuffling of the numbers [1,6] with no two variables sharing the same sequence.  This exercise was for the hosting of a beer tasting/testing party by my wife and me where the sequence of beers each person would taste/test would differ and be randomly generated (per the Stata random number generator).  In a previous blog posting, I described my approach to verifying that no two variables shared the same number sequence; what follows is the entirety of the .do file I used to create and output the randomly shuffled number sequences.

capture log close _all
log using BeerTestRandomNumbers, name(log1) replace

// program:
// task:  create list of beers for each person w/ list randomized for each person
// project:  Team Clisa Beer Testing/Tasting Party
// author:    cjt
// born on date:  20111024

// #0
// program setup

version 11.2
clear all
macro drop _all
set more off

// #1
// declare number of observations (beers to taste)
set obs 6

// #2
// set randomization seed to ensure reproducibility
set seed 20111119

// #3
// create format label for numbers...
label define beerf 1 "Ottakringer" 2 "Gosser" 3 "Stiegl" 4 "Puntigamer" ///
5 "Schwechater" 6 "Zipfer"

// #4
// create 28 variables containing numbers [1,6] randomly shuffled
foreach num of numlist 1/28 {
  gen int seq`num' = _n
  label values seq`num' beerf
  gen rand`num' = runiform()
  sort rand`num'
  drop rand`num'

// #5
// generate 27! variables for pairwise comparisons to verify that randomization // order isn't the same between any two variables. 
// !!all randomization orders are different!!
forvalues i = 1(1)27 {
 forvalues j = `i'(1)27 {
  display "-------------------------------------------------"
  display "Variables being compared are seq`i' and seq`++j'"
  gen var`i'_`j' = 1 if seq`i' == seq`j'
  quietly sum var`i'_`j' if `i' != `j'
  assert `r(sum)' < 6
  drop var`i'*

// #6
// label the variables w/ guest names (note that this code was taken from a Stata Journal
// article ("Speaking Stata:  Problems With Lists"), SJ3,2.
forvalues i = 1/28 {
  local v : word `i' of `vars'
  rename seq`i' `v'

codebook, compact

// #7
// -list- beer tasting key for each person
capture log BeerKeys close
log using BeerKeys, name(log2) replace
foreach var of varlist z_Todd-z_Unknown2 {
  list `var', sep(6)

log close _all

Monday, November 14, 2011

-forvalues- nested in -forvalues-

While not working on the statistical methods section of my proposal -- I'm banging my head against my desk trying to figure out how to assess predictive value in the absence of a gold standard -- I was playing around with Stata -forvalues- loops in a program I wrote that isn't even remotely related to my dissertation work.  Nevertheless, I was pleased when I managed to get a nested -forvalues- loop working properly (even if the time spent working on it was a couple hours more than expected).  

The problem:  I created 27 variables where each variable contained a random ordering of the number sequence [1, 6] and I wanted to compare the first variable with the 26 following, the second variable with 25 following, the third variable with the 24 following, and so on, to verify that no two variables contained equivalent number orderings.  I wanted, essentially, 26! (factorial) variables to be created with no two variable pairs being repeated.  For example, once the comparison of variable one and variable two was accomplished with the creation of the new variable, var1_2, it wasn't necessary to compare them again in the other direction, i.e. creation of var2_1.  So what I needed was an outer -forvalues- loop that looped from 1 to 26 and an inner -forvalues- loop that looped from  2 to 27.  After much conceptualizing, tweaking, and experimenting, I eventually found success with this:

forvalues i = 1(1)26 {
 forvalues j = `i'(1)26 {
  display "-------------------------------------------------"
  display "Variables being compared are seq`i' and seq`++j'"
  gen var`i'_`j' = 1 if seq`i' == seq`j'
  quietly sum var`i'_`j' if `i' != `j'
  assert `r(sum)' < 6
  drop var`i'*
Solution/Explanation:  In the first iteration of the outer -forvalues- loop, i is set to one and j is initially set to one, but as soon as the looping begins, the j index is incremented by one in the -display- statement.  The ++j syntax tells Stata to increment the index before evaluating the value thus changing the value from one to two.  Now when the comparison variable is generated, the macros evaluate to i=1 and j=2, thus creating var1_2 for the already-existing variables, seq1 and seq2.  After the first iteration of the inner loop is completed, the j index is again incremented from two to three -- the outer index is still one -- thus generating another variable, var1_3, comparing seq1 to seq3.  After the inner loop completes -- generation of 26 variables later -- the outer loop (index i) is incremented from one to two and the inner loop (index j) is reset to start at two but as soon as the looping begins, this value is incremented from two to three by way of the ++j macro call.  This results in the generation of var2_3, var2_4, and so on up thru the 27th variable, var2_27.  This tandem looping continues up thru the 26th variable when only one comparator variable is created:  var26_27

This was a novel programming exercise for me for two reasons:  (1) I haven't had a lot of experience with -forvalues- loops in Stata (none with nesting them!); and (2) the incrementation of a macro value when calling the macro is pretty damn powerful (++j) and a technique I plan to add to my Stata toolbox.

Tuesday, November 8, 2011

Diagnostic and Screening Test Validity

Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) constitute the four primary mechanisms for assessing the validity of a diagnostic or screening test.  You see the words "sensitivity" and "specificity" bandied about in the medical literature and discussed (albeit briefly) in some epidemiology or biostatistics textbooks, but I had yet to encounter as concise, well-written, and elegantly explained description of these diagnostic tools as the one in chapter eight of Trisha Greenhalgh's "How to Read a Paper:  The Basics of Evidence-Based Medicine".  (I had been considering a blog post on this topic for quite some time but hadn't gotten around to it -- big surprise -- until a recent email exchange with my adviser re: my difficulty in developing my statistical analysis plan and her subsequent clarification that my analysis should include a "predictive value" component.  The "predictive value" I eventually settle on may have little to do with sensitivity, specificity, PPV, and NPV, nevertheless, these concepts form the foundation that the aforementioned will draw upon.  This blog post borrows heavily from Greenhalgh's text.) 

The introduction of sensitivity and specificity can range from the use of conditional probability statements to the drawing up of a "jury verdict versus true criminal status" 2x2 table.  I learned sensitivity/specificity both ways and found that they complemented each other and enhanced my understanding of a concept that arises in the epi and medical literature with tremendous frequency.  In the "jury verdict versus true criminal status" method, a table is drawn up such that all possible outcomes are presented in the four cells of a 2x2 table.  In an ideal world, all murderers would be rightly convicted and those innocent would be rightly acquitted.  But the ideal world is rarely realized so we compute statistics to summarize the quality of a test, establish benchmarks, and either choose to ignore or use a test depending on its validity.  In this example, the sensitivity is the proportion of murderers that were convicted -- a/(a + c) -- whereas the specificity is the proportion of non-murderers acquitted -- d/(b + d).  The PPV is the probability that someone convicted of murder actually did it and the NPV is the probability that a person acquitted is actually innocent. 

True Criminal Status
Murderer (a + c)
Not murderer (b + d)
Jury Verdict
Guilty (a + b)
rightly convicted (a)
wrongly convicted (b)
Innocent (c + d)
wrongly acquitted (c)
rightly acquitted (d)

More formally, the definitions -- along with their probability statements and mathematical calculations assuming a 2x2 table configuration -- are presented below:

Test Primary Name
Alternative Name
Central Question the Test Answers
Conditional Probability Statement*
Test Formula**
True Positive Rate
How good is test at identifying those with the condition?
P(T+|D+) =

P(T+ ∩ D+)
a/(a + c)
True Negative Rate
How good is test at excluding those without the condition?
P(T-|D-) =

P(T- ∩ D-)
d/(b + d)
Positive Predictive Value (PPV)
Post-test Probability of Positive Test
What is probability of having condition if test is positive?
P(D+|T+) =

P(D+ ∩ T+)
a/(a + b)
Negative Predictive Value (NPV)
Post-test Probability of Negative Test
What is probability of not having condition if test is negative?
P(D-|T-) =

P(D- ∩ T-)
d/(c + d)
* P denotes probability, T denotes test, D denotes disease, and the + and – indicate positivity or negativity.
** The letters a, b, c, & d correspond to the four cells of a 2x2 table where a is the upper-left, b is the upper-right, c is the lower-left, and d is the lower-right.

Symbolically, the data can (and ought) to be presented by way of a 2x2 table:

Reference Criterion/Condition/Disease
Diseased (a + c)
Not Diseased (b +d)
Test Result
Positive (a + b)
True Positive (a)
False Positive (b)
Negative (c + d)
False Negative (c)
True Negative (d)

Now consider the example presented by Greenhalgh to illustrate the calculation and interpretation of sensitivity, specificity, PPV, and NPV.  The data are presented below with the calculations following:

Gold Standard Glucose Test (2h OGTT)
(6 + 21 = 27)
Not Diseased
(7 + 966 = 973)
Glucose Test Result
Glucose Present (6+7=13)
Glucose Absent
(21 + 966 = 987)

Sensitivity:  6/27 = 22.2%
Specificity:  966/973 = 99.3%
Positive Predictive Value (PPV):  6/13 = 46.2%
Negative Predictive Value (NPV):  966/987 = 97.9%

In this scenario, the sensitivity is lousy but the specificity is quite good.  That is, the test captures only about a fifth of those that are actually diseased whereas it identifies nearly all of those that are actually disease-free.  The PPV is the probability the person is actually diseased given that they have glucose present -- a value this low would warrant a second or follow-up test -- whereas the NPV, although not 100%, indicates that the probability of not being diseased, considering the negative test result, is quite high. 

One final point -- and another crucial distinction Greenhalgh makes between sensitivity/specificity and PPV/NPV that is sometimes glossed over in other texts is this:

"As a rule of thumb, the sensitivity or specificity tells you about the test in general, whereas the predictive value tells you about what a particular test result means for the patient in front of you.  Hence, sensitivity and specificity are generally used more by epidemiologists and public health specialists whose day-to-day work involves making decisions about populations" (pp. 103, emphasis in original text). 

Monday, August 29, 2011

Mountainman Ultra 80k: DNS

DNS:  Did Not Start.  As bummed as I was to have to withdraw from the race, I figured it was for the best since running with a less-than-healthy calf and Achilles heel probably wouldn't have been the wisest decision.  Lisa and I had, however, already purchased plane tickets to Switzerland so we scrapped our weekend plans for Lucerne (venue for the race) and instead went to Zermatt.  You can find the posting for that trip here.  Let's hope I can stay healthy enough to run an ultra over here in Europe at some point in the next six-nine months...

Wednesday, August 24, 2011

"I think you'll find it's a bit more complicated than that"

I just finished reading Ben Goldacre's "Bad Science:  Quacks, Hacks, and Big Pharma Flacks".  I liked it.  Actually, I really liked it.  What I liked most about it, though, was how you could sense how pissed off Goldacre is without the book devolving into a tired and lengthy tirade.  Goldacre is a medical doctor and, perhaps more importantly, a scientist that values data and data-driven evidence above all else.  He has no patience for anecdote, sham studies, and pseudo-scientific professions, namely homeopaths and nutritionists.  And he rails against the mass media and their inability (unwillingness?) to accurately report science stories resulting in a further "dumbing down" of the media and, most disturbingly, of the masses. 

I hesitate to reduce the book to a single sentence -- a theme -- but if forced to, a sentence on page 52 would suffice:  "Transparency and detail are everything in science."  So, so true.  But what of the execution of this mandate?  The use and presentation of statistics is a natural place to focus one's efforts, although the process starts long before the graphs are generated and the p-values are calculated:  "Overall, doing research robustly and fairly...simply requires that you think before you start" (pp. 53).  As for the statistics, Goldacre discusses statistical sleights of hand employed by mainstream medicine as well as devotes an entire chapter ("Bad Stats") to discuss how statistics are misused and misunderstood.  In his discussion on how mainstream medicine -- big pharma -- tries to distort data and results, he proffers the following tricks used by the pharmaceutical industry:

1.  Study the drug/device in winners.  That is, avoid enrolling people on lots of drugs or with lots of complications -- the "no-hopers" -- since they are less likely than a younger, healthier subject to exhibit any improvement. 
2.  Compare the drug against a useless control.  In this scenario, the drug companies intentionally avoid comparing their drug to the standard treatment (another drug on market) since placebo-controlled trials are more likely to yield unambiguously positive results.
3.  When comparing to another drug, administer the competing drug incorrectly or in a different dose.  This trick seems especially underhanded since only those with an intimate knowledge of the drug being manipulated would be able to identify the manipulation.  Nevertheless, doing this is intended to increase the incidence of side effects of the competing drug and, thus, make the experimental drug look much better.
4.  Avoid collecting data about side effects or collect them in such a way that key side effects are down-played or ignored altogether.
5.  Employ a "surrogate outcome"  rather than a real-world outcome.  For example, designate reduced cholesterol rather than cardiac death as the endpoint.  Reaching this endpoint is easier, cheaper, and faster to achieve.
6.  Bury the disappointing or negative data in your trial in the text of the paper.  And certainly don't highlight it by graphing it or mentioning it in the abstract or conclusion.  
7.  Don't publish or postpone publishing negative data after a long delay.  Perhaps in the meantime, another study returning less negative results (maybe even positive results?) -- and one likely conducted by the same investigators -- might supersede the negative results?  (Although Goldacre doesn't mention it in this section -- he does elsewhere -- sometimes the non-publication of negative results isn't altogether the investigators fault:  virtually all journals have been shown to exhibit "publication bias":  the bias towards publishing only studies with positive, statistically significant results.) 

And the more statistical-related tricks:

8.  Ignore the protocol entirely.  Rather than stick to the statistical analysis plan outlined in the protocol, run as many statistical tests as possible and report any statistically significant association, especially if it supports (even tangentially) your thesis.
9.  Play with the baseline.  "Adjust" for baseline values depending on which group is doing better at the beginning of the study:  adjust if the placebo group is better off and don't adjust if the treatment group is better off.
10.  Ignore dropouts:  These folks tend to fare worse in trials so don't aggressively follow them up for the endpoint or include them in analysis.
11.  Clean up the data:  Selectively include or exclude data based on how the data points affect the drug's performance.
12.  The best of...:  This refers to use of flexible interim analysis, that is, premature stopping of the trial if the results statistically favor the experimental drug or extension of the trial by a few months in hopes that the nearly significant results become significant.
13.  Torture the data.  Conduct all sorts of sub-group analyses in hopes that something, anything, pops out as statistically significant.
14.  Try every button on the computer:  Run every statistical test you can think of irrespective of whether it is appropriate or not.  You're bound to hit on something significant!

In the "Bad Stats" chapter, Goldacre briefly discusses the idea of "statistically significant" and since this is a central feature of research, it's worth repeating Goldacre's definition here:  "'statistically significant' [is] just a way of expressing the likelihood that the result you got was attributable merely to chance" (pp. 194).  He goes on:  "The standard cutoff point for statistical significance is a p-value of 0.05, which is just another way of saying, 'If I did this experiment a hundred times, I'd expect a spurious positive result on five occasions, just by chance'" (pp. 194-5).  

Science and data-driven decision making are, unfortunately, under full-on assault in the world (especially in the United States?) thus making this book as timely and necessary as ever.

Wednesday, August 17, 2011

Pre-Proposal Template, Final

On to development and writing of the proposal!  (That doesn't seem to justify any sense of elation, but given how long it took me to conceive of this topic -- waaaaaay longer than it should have? -- I might as well soak it up.  Little victories!)

Anyway, in a previous post I provided a template for a pre-proposal:  a four to six page document outlining the proposed research topic, why it's important, and how the research question will be answered.  Since that time, I had to pare the pre-proposal down into a two page document (three including a reference page) for circulation to the network of clinicians that own the data and although not as comprehensive as the first pre-proposal, I think this version constitutes a good "final".  (Now let's hope the network approves the research topic.)
Here's to the permanent archival of my pre-proposal and with it, a more focused development and assembly of the actual proposal...

Thursday, August 11, 2011

I hope I don't get chased...

...because I'm not sure I could run from them.

I've been injured before -- what athlete hasn't? -- but none of my prior injuries sidelined me like the one I'm currently working through.  In early June I went out for a run with my sister-in-law and we were running at a decent clip -- faster than I would usually run -- but by no means so fast that we still couldn't chat with each other.  During the run my Achilles area on my left leg, as well as my left calf, felt a bit tender but not so much so that it was uncomfortable.  I thought nothing of it.  That is, until I tried to run again a couple of days later and was unable to.  I didn't even make it to the end of the block.  I returned to my apartment building -- dejected and confused -- but determined to get a workout in out so I went down to the basement gym and proceeded to row 4,000 meters or so.  Turns out -- so obvious in retrospect -- that substituting a rowing workout for a running workout when you have tenderness in your lower leg area isn't the smartest idea.  I stopped running altogether but continued to row for the next week or so with the mild discomfort subsiding as I rowed.  I thought things were looking up...until I went to go see my wife's chiropractor in Arlington, VA (NOVA Pain & Rehab) when we returned to the DC area for a week.  Following the consult, he diagnosed me with an acute case of tendonitis in my lower left leg and recommended that I avoid running altogether for at least two weeks, ice my lower leg twice daily, tape the calf, foam roll my IT band & calf, and start strengthening the muscles on the lateral side of my leg.  And if I chose to ride my bicycle in the meantime, I was to avoid any hill-climbing.  The prognosis wasn't what I was hoping for, but it could have been worse.  And in the grand scheme of things -- insofar as athletic injuries are concerned -- I think I've been pretty fortunate over the years.  What I was most bummed about, though, was that I'd have to withdraw from an ultramarathon I was slated to run in mid-August.  This was going to be my first European ultra and I was quite excited to run it, but alas, there's always next year...

This thing gets a lot of love from me

As for what was responsible for my injury -- I can't say with 100% conviction what it was -- but I suspect it may have been from months of residual stress due to minimalist running.  My wife, however, thinks it may have been from the 2-3 times per week rowing workout.  Either way, I've since returned to some running and feel relatively decent, although I'm running in my trail shoes (Montrail Mountain Masochists) rather than my Vibram Five Fingers.  I'm not sure whether I'll return to running pavement in the Vibrams full-time -- maybe once a week to maintain foot and ankle strength -- but I'm not sure my stride has become efficient enough to handle near 100% minimalist running. 

During the down-time I reacquainted myself with my rode bike and although she left me saddle-sore after nearly every ride for the first couple of weeks, I've now regained my tolerance and quite enjoy our time together as the miles roll by.  Here's to continuing to cross-train and complement my running with a regular road ride...

Riding along the bike path bisecting the Danube Island

Monday, August 8, 2011


A friend (and former class mate) and me exchanged emails a few days ago after I had recommended a couple of R books on her Fb page (she was soliciting the help of someone with access to SAS) and during the email exchange, I offered to run her code in SAS so as to corroborate her output in R.  (Disclaimer:  I may own a handful of R books but I'm not, by any means, nearly as proficient in R as in SAS.)  She sent me a snippet of her code in R and she was trying to run a bivariate logistic regression where the independent variable (IV) was categorical with a half-dozen categories.  Although you can obtain a single estimate (e.g. odds ratio) for the IV, this rarely makes much sense unless the IV is ordinal and the effect on the dependent variable (DV) is thought to be linear.  Unlike SAS, it isn't immediately obvious how to modify the general linear model command (glm) in R to obtain odds ratios for k-1 levels of a categorical variable, relative to level k.  Turns out, you don't modify the glm command:  you declare the variable to be a "factor" variable prior to running the command.  In SAS, the LOGISTIC procedure allows you declare any categorical variables as "class" variables before the MODEL command with the option to explicitly specify which one of the levels will be the referent.  (In R, you need to "re-level" the categorical variable with the relevel() command.)  This seems fairly intuitive in retrospect, but given that I had never conducted a logistic regression in R and haven't been using the program much lately, I had to browse through a couple of my R books -- "SAS and R:  Data Management, Statistical Analysis, and Graphics" (Kleinman & Horton) and "Modern Applied Statistics with S" (Venables & Ripley) -- to get where I needed to go.  The dataset I used to compare outputs in R and SAS was referenced in Venables and Ripley's text but originally came from Hosmer and Lemeshow's "Applied Logistic Regression".  The dataset contains n=189 observations and sought to identify risk factors leading to low birth weight babies.  Since my friend was most keen on estimating odds ratios for a categorical variable, I decided to run the model presented on page 32 of Hosmer & Lemeshow where the IVs were race and subject (mother) weight (this also allowed me to merry up my results with those contained in the textbook).  The code for each program follows:


# ## read in low birth weight data (Hosmer & Lemeshow, 1989)

# ##verify dataset contents

# ##change categorical independent variable to factor variable
birthwt$race <- factor(birthwt$race,
birthwt$race <- as.factor(birthwt$race)
# ##Logistic model
birthwt.glm <- glm(low ~ lwt + race, family=binomial,

# ##exponentiate coefficients and return confidence intervals
exp(confint(birthwt.glm, level=.95))

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
# if want to change reference category of race, then use -relevel()- and re-run GLM
birthwt$race <- relevel(birthwt$race, "other")  

And the SAS code:
* **verify data contents
proc contents data=birthwt position; run;

* **format race variable;
proc format;
  value racef 1='White' 2='Black' 3='Other';
proc freq data=birthwt; tables race; format race racef.; run;

* **Logistic Model;
proc logistic data=birthwt descending;
  class race (param=REF ref='White');
  model low = lwt race / risklimits;
  format race racef.;

The coefficients and odds ratios (exponentiated coefficients) returned in the two programs were identical, although the confidence intervals on the odds ratios exhibited negligible differences.

Friday, August 5, 2011

Macro: May I Have This Date?

In my last job I wrote a SAS program that read in 150+ quotes (mostly snarky, oxymoronic, and anti-religious that were all of my choosing), randomly selected one, then printed it at the bottom of a daily email (batch verification) sent to the project biostatisticians.  (I'm not sure how I justified doing this when there were more pressing tasks to be completed.  Oh wait -- I remember -- a little levity was needed to temper the dysfunction infecting the project.)  

At any rate, the random selection of the daily quote in my SAS program required just a couple of steps.  First, I assigned a 'random' number from the uniform distribution to each quote with the 'seed' being the current date.  (If you don't specify a seed, I think SAS may default to using the current date, but in an effort to minimize confusion, I chose to be explicit.)  I then sorted the quotes in ascending order according to the randomly generated numbers with the first one being selected for output to the daily email.  I wanted to duplicate this process in Stata (probably should have been working on my proposal...oh well) but instead of outputting the quote to an automatically generated email, I sought to write a program (.ado file) that would display a randomly selected quote into the output window with the calling of my user-written command, -quote-.  Seemed straightforward enough, until I had to assign the current date to a macro in Stata...

In both SAS and Stata, dates are numbers in that the value assigned to the date is the number of days lapsed (or preceding) January 1, 1960.  This means that unless you format the date variables (or macros) you create to display a date format (e.g. "08/05/2011", "20110805", "August 5, 2011", etc.), you'll simply get the number of days since 01/01/1960 (e.g. 18,844).  This makes working with dates relatively simple and intuitive, assuming that the system date is stored as number.  In SAS, you can assign the current date to a variable or macro by invoking the system-stored current date via the today() function (per SAS documentation:  "a function that returns a SAS date value corresponding to the date on which the SAS program is initiated").  In Stata, however, the system date value is accessed by invoking either the global macro, $S_DATE, or the system value for the current date [c(current_date)].  If I were interested in simply printing the current date on, say, an updated daily graph, I could just assign the system date to a macro, let's call it cdate, and be done with it since it defaults to printing the actual date ("5 Aug 2011").  But since I wanted the date to be a number (for seeding purposes), I needed the value to be the numeric date (i.e. the number of days since 01/01/1960).  This required a quick search of the Statalist Archives to find out if any other users have dealt with the issue and if so, how.  Fortunately for me, others had and the workaround was rather simple:  remove the spaces from the date value then declare the value a Stata date formatted as day-month-year ("DMY").  (I didn't find exactly what I needed on the listserve, although I was able to tweak the suggestions proffered to get what I needed.)  I then assigned this date value to a Stata macro.  The code I used to assign a random number from the uniform distribution for each quote for both Stata and SAS follows:

local cdate = date(subinstr("$S_DATE" , " " , "" , .), "DMY")
* set seed to current date...
set seed `cdate'
* assign random number from uniform distribution
gen xselect = runiform()

*Generate random num for each quote using the uniform distrib.;
data RandomNumbers;
  do i=1 to &NumQuotes;

In the SAS program, I had to generate the random numbers in a separate dataset with the number of observations equaling the number of quotes (&NumQuotes) then merge said dataset with the dataset containing the quotes.  In Stata this step wasn't necessary -- I was able to create a new variable, xselect, containing randomly generated numbers from the uniform distribution directly into the dataset containing the quotes.  In both the Stata and SAS programs, I then sorted by the variable containing the randomly generated numbers -- xselect in Stata and r in SAS -- and chose the first quote (i.e. the quote with the lowest random number value).  

After identifying the first quote in each program, I assigned the quote to a macro then output the quote to either an automatically-generated email (SAS) or the output window (Stata).  With my departure from my previous job, I'm obviously no longer subjecting myself to a barrage of daily emails, although the snark, oxymoronica, or anti-religious quotes are just one short command away in Stata...

. quote
Quote of the day: 
    Maybe this world is just another planet's hell. (Aldous Huxley)

Wednesday, July 13, 2011

Workflow and Stata

I just finished reading J. Scott Long's "The Workflow of Data Analysis Using Stata" and this book is, in short, well-written and a great resource.  I bought this book because I wanted to approach and conduct my dissertation research much more systematically -- I didn't want to get into the situation of having a sprawling directory structure, multiple "analysis" datasets, and so many .do files I couldn't possibly reproduce my results.  I have yet to embark on any meaningful statistical analysis for my dissertation -- that follows submission of my proposal and IRB approval -- although I did assemble the directory structure, mapped out variable and value labels for key variables, and implemented a standard .do file structure using the advice proffered in this book. 

Prior to getting into the nitty-gritty, Long outlines the four steps in the workflow process:  data cleansing, analysis, results presentation, and file protection.  The first step -- data cleansing -- is, essentially, the crux of the book and constitutes the lion's share of the text.  In this first step -- it implicitly includes data preparation -- Long discusses how to plan, organize, and document your analysis, the mechanics of writing a smart and readable .do file, use of macros and loops to increase efficiency and reduce errors, and, of course, variable creation, naming, and labeling.  The section on assembling a directory structure is especially helpful -- Long provides example directories for simple projects, large one-person projects, and collaborative projects.  Long also provides the reader with a couple of templates for .do files:  one simple and one complex (I tried to figure out how to create something like this a couple of years ago but was unsuccessful).  The template I've started using in all of my analyses is the "complex" one and, per Long's direction, I load the template using the keystroke Alt+1 -- I set up the script using AutoHotKey -- although I haven't figured out how to have my name and date automatically loaded into the script (I manually add my name, the program name, and date after I load the template to the .do file).  An example of the script follows:

capture log close
log using _name_, replace

// program:
// task:
// project:
// author:    _who_
// born on date:  _date_

// #0
// program setup

version 11.2
clear all
macro drop _all
set memory 75m
set more off

// #1
// describe task 1

<other tasks here>

log close

Pretty nifty, eh?  Yeah, I thought so, too.   In addition to the .do file template, Long also suggests how .do files should be named so that all of your analysis can be run from a master .do file in a logical and reproducible way (incorporation of numbers into the .do file name such that the proper sequence of .do files is maintained).

Long concludes his text with chapters on presenting results and file protection, the former of which I've struggled with in terms of efficiency and the latter of which has occurred to me from time to time.  He motivates the section on presenting results with a quote from "The Art of Scientific Writing" by Ebel, Bliefort, and Russey:  "Underlying all of natural science is a rather remarkable understanding, albeit one that attracts relatively little attention:  Everything measured, detected, invented, or arrived at theoretically in the name of science must, as soon as possible, be made public -- complete with all the details."  Long then goes on to briefly discuss tables, spreadsheets, graphs, and outputting of results from estimation commands.  This section is a good jumping off point for further research into how best to output and present results from your statistical analysis.  The section on file protection usually gets short shrift in the grand scheme of data analysis, but Long manages to impart how important this step is -- and one I need to implement on a regular basis. 

If you are data analyst, statistician, researcher, or even a professor -- I suspect you'll benefit from this book (I know I will) -- especially since Long writes from the perspective of all four and, thus, draws from both his good and bad experiences.

Tuesday, June 21, 2011

Popularity Contest

Stata, SAS, or R?  Not quite a life-or-death decision, but one most data analysts encounter at some point in their professional/academic lives.  Most data analysts profess some sort of allegiance to a particular program -- we aren't quite as fanatical as Red Sox or Yankee fans, however -- although most of us have strong views about the strengths and weaknesses of the programs we use as well as the ones we don't use.

My first encounter with a data analysis software was in a "Statistics for Scientists and Engineers" course over ten years ago.  The course included a lab where we had to use SAS to input & manipulate data, generate summary & descriptive statistics, and output graphs.  Although the tasks and the material were relatively straightforward, all I remember is being so confused.  I just didn't get it.  I couldn't understand how you could just declare a variable in a DATA step and be done with it.  To say I wasn't a natural would be a gross understatement.  My next experience using SAS (undergraduate Econometrics course), fortunately, was much more encouraging, as were all the experiences following, even going so far as to take (and pass) the Base SAS Competency Exam shortly after finishing my masters.

My experience with Stata, however, couldn't have been more different.  Shortly after starting my masters, a classmate recruited me to help her TA an "Introduction to Biostatistics" course in the Master of Public Health (MPH) program where our primary responsibility was to teach the lab component of the course using Stata.  Maybe it was my exposure to SAS two years prior and the chance to develop a "programming sense", but Stata was more intuitive.  I got it.  And I enjoyed teaching it to others.  I even subscribed to the Stata listserv, bought a Stata mug, and eventually gave a presentation at a Stata User Group meeting about how the course I TA'd incorporated Stata into the curriculum.  I'm planning to use Stata as my primary data management and statistical analysis program for my dissertation. 

I don't consider myself an expert in either SAS or Stata -- I think you'd have to spend the better part of your working adult life programming in either/both language(s) to make that claim -- and am even less so with R.  I first encountered R as a masters student -- I even bought a textbook about how to conduct regression using R -- but was so enamored with Stata that I didn't develop much skill using it.  Fast forward several years and R emerges as one of the most used, discussed, and capable data management/analysis programs on the market.  Everyone seems to be using it -- academics and pharmaceutical companies alike -- with little indication that its growth is about to slow.  And it certainly helps that R is freely available for download with its capabilities continuing to grow everyday.  I started using R in earnest -- mostly estimation of cumulative incidence between treatment groups and output of accompanying graphs -- while working as a biostatistician for a contract research organization and although my current use has scaled back from an already low level, I will continue to use it when needed.

So which data analysis software is most popular?  My experience using SAS, Stata, and R has been driven by both happenstance and circumstance -- a history likely shared by many data analysts -- thus making it difficult to unequivocally determine which software is most popular.  This hasn't stopped Robert Muenchen, a statistical consultant with ~29 years of experience, from trying to find out, however.  In his periodically updated analysis of which software is most popular, accessible here, he presents various ways of measuring popularity and market share.  Although no one software comes out on top with respect to every metric, R appears to be chattered about the most on the web (figure 1), Stata appears to be emerging as a formidable software program in academic circles (figure 2), and SAS (as well as SPSS) appear most often among the software skills that employers are seeking (figure 3).  All three figures and their captions are taken from the website.
Figure 1.  Plot of listserv discussion traffic by year (through 12/31/2010).
Figure 2. Impact of data analysis software on academic publications as measured by hits on Google Scholar.
Figure 3. Number of jobs listing each software package in its requirements on June 27, 2010. The maximum they will display is 1,000.
And, finally, the point?  Well, no one program appears to trump all the others.  So the best course of action?  Accumulate experience with all three, but master one (or two).

Thursday, June 9, 2011

Pre-Proposal Template, v.1

The research/dissertation phase of a doctoral program can be very peculiar.  Some programs provide a well-worn map of the research/dissertation process whereas others, well, not so much.  Perhaps in the latter this is intentional and meant to emulate the research process in 'the real world'?  Or maybe this is the graduate-level version of having to walk uphill to school both ways?  I don't know.  But either way, my program is of the latter flavor and although I appreciate the lessons I'm learning about development of a research topic, I also appreciate efficiency, mentorship, and guided-but-restrained direction.  As I write this, I realize that maybe my views are terribly misguided and naive, but at this point, I'm unable to completely sever the idea that the process can't be improved.  And thus the topic of this post.  While trying to identify a research topic and distill it into a workable thesis I iterated through a couple of prospectuses (research idea #1), then a couple of concept papers (research idea #2), and now a pre-proposal (still research idea #2 -- yay!).  Up until a few of months ago, I was creating and circulating these documents under the impression that they weren't really required (the documents helped to shape and guide my thinking).  Turns out I was wrong; apparently all that is required is a one page description of the topic.  But even this appears to not be 100% accurate, considering what a fellow student assembled and submitted to the faculty.  Assuming that past student experience is more reliable that present prescription, a 3-6 page double-spaced pre-proposal appears to be what is really sought.  Embedded is the format I'm currently using (my pre-proposal has yet to be accepted and until it is, I'll delay posting the format of the final version).

Tuesday, May 31, 2011


Epidemiology is literally defined as "the study of that which falls upon the common people" -- a definition which follows from the parsing out of the word into its Latin components:  epi (on or upon), demos (the common people), and logy (study).  The definition has evolved over the last century or so and in my first Introductory Epidemiology textbook, the following is offered:  "The study of the distribution and determinants of disease frequency in human populations and the application of this study to control health problems" (Aschengrau & Seage, 2003, Essentials of Epidemiology in Public Health).  In my intermediate textbook, the authors spare the reader an exposition of the term and instead summarize epidemiology as "the study of the distribution and determinants of health-related states or events in specified populations and the application of this study to control health problems" (Szklo & Nieto, 2007, Epidemiology:  Beyond the Basics).  And in my most advanced epidemiology textbook, a definition isn't even proffered until the third chapter:  "epidemiology [is] the study of the distribution of health-related states and events in populations" (Rothman, Greenland, & Lash, 2008, Modern Epidemiology).  Perhaps the most interesting characterization of epidemiology that I've come upon, however, is the one offered by Elizabeth Pisani, an HIV/AIDS epidemiologist and author of The Wisdom of Whores:  Bureaucrats, Brothels, and the Business of AIDS.  She summed up epidemiology as "sex and drugs" since, at least in her field of study, these are the two primary mechanisms by which HIV is transmitted.  Hers is technically not a definition -- more like a flippant answer to a question about what she does for a living -- but rather a view shaped by years of working in the gritty underworld of prostitution and drug injection. 

Epidemiology can further be further classified as descriptive or analytic.  In the former, the primary goal is to describe the epidemiologic trifecta:  person, place, and time.  The epidemiologist engaged in this activity wants to know which population a disease (health-related state) is affecting, where the disease is taking hold, and when it occurred.   Equipped with this information, an epidemiologist can identify high-risk populations and generate causal hypotheses, the testing of which falls under the purview of analytic epidemiology.  When articles in the popular press state that factor X is associated with premature death, factor Y contributes to higher virility, or factor Z leads to improved well-being among the newly married, an epidemiologist was testing a hypothesis about whether a risk factor is associated with a health outcome.  Note that epidemiology and the statistical techniques underlying the discipline have become so pervasive that a health outcome of interest may be beneficial and the risk factor(s) may not, in fact, be risky.  Modeling survival (longevity) among elderly persons -- rather than death -- and whether living longer is associated with "positive affect" is but one example of having both a positive health outcome and positive risk factor (the June 2011 issue of The American Journal of Epidemiology features an article exploring the aforementioned relationship).  Most epidemiologic research, including my own, is of the analytic variety. 

Friday, May 27, 2011

I asked for it

In the fall of 2006 I enrolled in a Ph.D. program at The George Washington University in Washington, DC.  I began my journey toward a Ph.D. as a student on the biostatistics track but the longer I remained on that track, the more obvious it became that it wasn't the best fit.  So I petitioned for a transfer to the epidemiology track, made the switch, then spent the 2008 academic year taking a few courses specific to the epi track that I hadn't already taken, preparing for my written comprehensive exam and, finally, taking (and passing) the exam in late summer 2009.  With the substantive coursework behind me, I began feeling out a faculty member for the possible appointment as head of my dissertation committee and, more importantly, tentatively identifying a potential research topic & question suitable for dissertation research.  I was expecting this process to follow somewhat quickly after the comprehensive exam, but alas, I checked-out mentally for a semester, got married the next semester, then moved abroad a few months later.  As of late-May 2011, my research question and study aims are still taking shape -- development of the research proposal can't begin in earnest until a topic and objective are identified -- although I'm getting close.  At least that is what I keep telling myself.

These ramblings are many things:  a forum to vent, a place to map out methodology concepts, share programming code, and, above all, chronicle the masochism that is the conduct of dissertation-level research.

Tuesday, April 19, 2011

Vienna City Marathon

I'm retiring from running.  At least I think I am.  OK, that's not likely, but I certainly felt like it after this race. 

My wife and I ran this race together and although she ran a solid race -- she finished strong and would have finished stronger if she hadn't waited for me -- I can't say the same for me.  It just felt like an off-day (not that I was expecting to compete with the front-runners!) and one that got worse the longer I was out there.  It's not my intent to rationalize, but I think my experimentation with a couple of minimalist shoes may have contributed to my shoddy day.  Several months ago I purchased a pair of Vibram Five Fingers and have been pleased with them insofar as my runs were 10-12 miles or less.  For my longer runs on the pavement, however, I returned to running in a pair of Nike Air Equalons and although these are good shoes that have served me well in the past, I think my evolving running stride and form have outgrown the shoe (knees would start to ache after 15 or so miles).  So I decided to purchase a pair of Nike Free Run+ shoes since they provide the cushion that the Vibrams don't without the support and bells-and-whistles that the Equalons do.  The sizing of the shoes, however, isn't one-to-one relative to the Equalons -- you need to size up a bit to accommodate the flattening (extension) of your foot -- and I, unfortunately, didn't appreciate this until a third the way through the marathon (I rec'd the shoes the week leading up to the marathon).  I'm still a believer in the advantages of minimalist running and am unlikely to return to using heavily fortified cushion/stability shoes, nevertheless, it's obvious that I need to size up my pair of Nike Free Run+ shoes or find another minimalist shoe suitable for long(er) asphalt runs. 

Aside from the shoe/feet problems, the race couldn't have been better:  it was a brilliant sunny day, the course was scenic, the crowd support (400,000 spectators!) was incredible, and race day coincided with the nine month anniversary of our (me and my wife's) move to Vienna.  This is a race definitely worth running either as a novice or veteran and one I hope to run again, albeit with happy feet. 
Near the start line (photo taken from

Wednesday, March 30, 2011


Spring and summer are my two favorite seasons.  Especially for running.  Although winter in Vienna is tolerable, it can quickly start to feel like a long, gray, and dreary slog so when the weather warms, the sun sets later, and the grayish sky turns an inviting blue, my runs become that much more enjoyable.  Yesterday, for instance, the sky was clear and the mercury hovered around 60F when I headed out.  It was beautiful.  And I wasn't the only one who thought as much:  several other runners, walkers, cyclists, and those otherwise enjoying the weather were out along the Danube Canal en mass.  Having come from Washington, DC I'm used to high-traffic running and cycling paths -- DC is ranked as one of the best running cities in the United States -- but in Vienna those same paths seem more widely used and by a wider spectrum of people.  Older folks commandeer the benches and soak up the sunshine during the day, graffiti taggers spray paint then spray paint again the walls along the path, and, of course, runners, bicycle commuters & roadies, as well as casual & speed walkers use the path from sunup to sundown.  I recently read somewhere (can't remember the exact source) that half of Vienna is open/green space.  If so, I think the city has hit on something brilliant because, it seems, what gets built gets used ("if you build it, they will come?").

Thursday, March 10, 2011


There are some unbelievably talented ultra runners out there:  Karl Meltzer, Geoff Roes, Anton Krupicka, Kilian Jornet, Dakota Jones, and Scott Jurek, among others.  All of these stars have either a blog or personal website where they chronicle their training, write race reports, and discuss issues important and relevant to them.  And if those of us who follow them are lucky, they'll also post videos featuring their running heroics.  I very much value the written word, but sometimes a picture, or even better -- a video -- inspires where paper doesn't and animates where paper can't.  I've embedded two such videos below.  The first features Anton Krupicka running in the winter and the second provides a glimpse into how Killian Jornet trains.  Damn good stuff.

Tuesday, February 22, 2011

Love Triangle

I love DC.  But I'm also enamored with Vienna, Austria.  (I've only been living in Austria for eight months whereas I lived in DC for 5.5 years.  I don't want to rush things.)  When I lived in the district -- I also lived in Arlington for the first 18 months of my time in the DC area -- I took up residence in Van Ness and Cleveland Park, both located in Northwest and mere minutes from Rock Creek Park (RCP).  Most people that descend into RCP run and cycle on the paved trail adjacent to the road bisecting the park -- perhaps they are unaware of the extensive trail network or they just don't care to use it? -- whereas if I ran anywhere near RCP, I invariably ran part of my route on some bona fide trail in the park.  The trail system in RCP is so extensive and includes so many branches that both short jaunts and longer runs are easily accommodated.  Sadly, as my days in DC started to dwindle, the reality of not living up the road from RCP really started to set in.  This bittersweet feeling, however, didn't last long when I considered -- really considered -- the life, the experiences, and the trail running (in some of the most beautiful country on earth) waiting for me overseas.  Below are a few pictures of my sultry DC mistress followed by a few of the mistress whose company, also, I cannot resist.  

The closest access point to RCP from our apartment in Cleveland Park.
Trail heading into RCP
The Valley Trail traverses a significant portion of RCP.  This sign is near the Maryland border.
Heading up a ridge.
The wooded area surrounding Vienna -- the "Vienna Woods" -- however, are not to be outdone:
The trail heading up toward Kahlenberg.  No snow at this elevation...
...Although just a bit higher, a dusting of snow awaits.
There are several small vineyards in the Vienna Woods.  This is one.  Although foggy, you can make out a few office buildings in the distance.