## Monday, August 29, 2011

### Mountainman Ultra 80k: DNS

DNS:  Did Not Start.  As bummed as I was to have to withdraw from the race, I figured it was for the best since running with a less-than-healthy calf and Achilles heel probably wouldn't have been the wisest decision.  Lisa and I had, however, already purchased plane tickets to Switzerland so we scrapped our weekend plans for Lucerne (venue for the race) and instead went to Zermatt.  You can find the posting for that trip here.  Let's hope I can stay healthy enough to run an ultra over here in Europe at some point in the next six-nine months...

## Wednesday, August 24, 2011

### "I think you'll find it's a bit more complicated than that"

I just finished reading Ben Goldacre's "Bad Science:  Quacks, Hacks, and Big Pharma Flacks".  I liked it.  Actually, I really liked it.  What I liked most about it, though, was how you could sense how pissed off Goldacre is without the book devolving into a tired and lengthy tirade.  Goldacre is a medical doctor and, perhaps more importantly, a scientist that values data and data-driven evidence above all else.  He has no patience for anecdote, sham studies, and pseudo-scientific professions, namely homeopaths and nutritionists.  And he rails against the mass media and their inability (unwillingness?) to accurately report science stories resulting in a further "dumbing down" of the media and, most disturbingly, of the masses.

I hesitate to reduce the book to a single sentence -- a theme -- but if forced to, a sentence on page 52 would suffice:  "Transparency and detail are everything in science."  So, so true.  But what of the execution of this mandate?  The use and presentation of statistics is a natural place to focus one's efforts, although the process starts long before the graphs are generated and the p-values are calculated:  "Overall, doing research robustly and fairly...simply requires that you think before you start" (pp. 53).  As for the statistics, Goldacre discusses statistical sleights of hand employed by mainstream medicine as well as devotes an entire chapter ("Bad Stats") to discuss how statistics are misused and misunderstood.  In his discussion on how mainstream medicine -- big pharma -- tries to distort data and results, he proffers the following tricks used by the pharmaceutical industry:

1.  Study the drug/device in winners.  That is, avoid enrolling people on lots of drugs or with lots of complications -- the "no-hopers" -- since they are less likely than a younger, healthier subject to exhibit any improvement.
2.  Compare the drug against a useless control.  In this scenario, the drug companies intentionally avoid comparing their drug to the standard treatment (another drug on market) since placebo-controlled trials are more likely to yield unambiguously positive results.
3.  When comparing to another drug, administer the competing drug incorrectly or in a different dose.  This trick seems especially underhanded since only those with an intimate knowledge of the drug being manipulated would be able to identify the manipulation.  Nevertheless, doing this is intended to increase the incidence of side effects of the competing drug and, thus, make the experimental drug look much better.
4.  Avoid collecting data about side effects or collect them in such a way that key side effects are down-played or ignored altogether.
5.  Employ a "surrogate outcome"  rather than a real-world outcome.  For example, designate reduced cholesterol rather than cardiac death as the endpoint.  Reaching this endpoint is easier, cheaper, and faster to achieve.
6.  Bury the disappointing or negative data in your trial in the text of the paper.  And certainly don't highlight it by graphing it or mentioning it in the abstract or conclusion.
7.  Don't publish or postpone publishing negative data after a long delay.  Perhaps in the meantime, another study returning less negative results (maybe even positive results?) -- and one likely conducted by the same investigators -- might supersede the negative results?  (Although Goldacre doesn't mention it in this section -- he does elsewhere -- sometimes the non-publication of negative results isn't altogether the investigators fault:  virtually all journals have been shown to exhibit "publication bias":  the bias towards publishing only studies with positive, statistically significant results.)

And the more statistical-related tricks:

8.  Ignore the protocol entirely.  Rather than stick to the statistical analysis plan outlined in the protocol, run as many statistical tests as possible and report any statistically significant association, especially if it supports (even tangentially) your thesis.
9.  Play with the baseline.  "Adjust" for baseline values depending on which group is doing better at the beginning of the study:  adjust if the placebo group is better off and don't adjust if the treatment group is better off.
10.  Ignore dropouts:  These folks tend to fare worse in trials so don't aggressively follow them up for the endpoint or include them in analysis.
11.  Clean up the data:  Selectively include or exclude data based on how the data points affect the drug's performance.
12.  The best of...:  This refers to use of flexible interim analysis, that is, premature stopping of the trial if the results statistically favor the experimental drug or extension of the trial by a few months in hopes that the nearly significant results become significant.
13.  Torture the data.  Conduct all sorts of sub-group analyses in hopes that something, anything, pops out as statistically significant.
14.  Try every button on the computer:  Run every statistical test you can think of irrespective of whether it is appropriate or not.  You're bound to hit on something significant!

In the "Bad Stats" chapter, Goldacre briefly discusses the idea of "statistically significant" and since this is a central feature of research, it's worth repeating Goldacre's definition here:  "'statistically significant' [is] just a way of expressing the likelihood that the result you got was attributable merely to chance" (pp. 194).  He goes on:  "The standard cutoff point for statistical significance is a p-value of 0.05, which is just another way of saying, 'If I did this experiment a hundred times, I'd expect a spurious positive result on five occasions, just by chance'" (pp. 194-5).

Science and data-driven decision making are, unfortunately, under full-on assault in the world (especially in the United States?) thus making this book as timely and necessary as ever.

## Wednesday, August 17, 2011

### Pre-Proposal Template, Final

On to development and writing of the proposal!  (That doesn't seem to justify any sense of elation, but given how long it took me to conceive of this topic -- waaaaaay longer than it should have? -- I might as well soak it up.  Little victories!)

Anyway, in a previous post I provided a template for a pre-proposal:  a four to six page document outlining the proposed research topic, why it's important, and how the research question will be answered.  Since that time, I had to pare the pre-proposal down into a two page document (three including a reference page) for circulation to the network of clinicians that own the data and although not as comprehensive as the first pre-proposal, I think this version constitutes a good "final".  (Now let's hope the network approves the research topic.)

Here's to the permanent archival of my pre-proposal and with it, a more focused development and assembly of the actual proposal...

## Thursday, August 11, 2011

### I hope I don't get chased...

...because I'm not sure I could run from them.

I've been injured before -- what athlete hasn't? -- but none of my prior injuries sidelined me like the one I'm currently working through.  In early June I went out for a run with my sister-in-law and we were running at a decent clip -- faster than I would usually run -- but by no means so fast that we still couldn't chat with each other.  During the run my Achilles area on my left leg, as well as my left calf, felt a bit tender but not so much so that it was uncomfortable.  I thought nothing of it.  That is, until I tried to run again a couple of days later and was unable to.  I didn't even make it to the end of the block.  I returned to my apartment building -- dejected and confused -- but determined to get a workout in out so I went down to the basement gym and proceeded to row 4,000 meters or so.  Turns out -- so obvious in retrospect -- that substituting a rowing workout for a running workout when you have tenderness in your lower leg area isn't the smartest idea.  I stopped running altogether but continued to row for the next week or so with the mild discomfort subsiding as I rowed.  I thought things were looking up...until I went to go see my wife's chiropractor in Arlington, VA (NOVA Pain & Rehab) when we returned to the DC area for a week.  Following the consult, he diagnosed me with an acute case of tendonitis in my lower left leg and recommended that I avoid running altogether for at least two weeks, ice my lower leg twice daily, tape the calf, foam roll my IT band & calf, and start strengthening the muscles on the lateral side of my leg.  And if I chose to ride my bicycle in the meantime, I was to avoid any hill-climbing.  The prognosis wasn't what I was hoping for, but it could have been worse.  And in the grand scheme of things -- insofar as athletic injuries are concerned -- I think I've been pretty fortunate over the years.  What I was most bummed about, though, was that I'd have to withdraw from an ultramarathon I was slated to run in mid-August.  This was going to be my first European ultra and I was quite excited to run it, but alas, there's always next year...

 This thing gets a lot of love from me

As for what was responsible for my injury -- I can't say with 100% conviction what it was -- but I suspect it may have been from months of residual stress due to minimalist running.  My wife, however, thinks it may have been from the 2-3 times per week rowing workout.  Either way, I've since returned to some running and feel relatively decent, although I'm running in my trail shoes (Montrail Mountain Masochists) rather than my Vibram Five Fingers.  I'm not sure whether I'll return to running pavement in the Vibrams full-time -- maybe once a week to maintain foot and ankle strength -- but I'm not sure my stride has become efficient enough to handle near 100% minimalist running.

During the down-time I reacquainted myself with my rode bike and although she left me saddle-sore after nearly every ride for the first couple of weeks, I've now regained my tolerance and quite enjoy our time together as the miles roll by.  Here's to continuing to cross-train and complement my running with a regular road ride...

 Riding along the bike path bisecting the Danube Island

## Monday, August 8, 2011

### -glm- and PROC LOGISTIC

A friend (and former class mate) and me exchanged emails a few days ago after I had recommended a couple of R books on her Fb page (she was soliciting the help of someone with access to SAS) and during the email exchange, I offered to run her code in SAS so as to corroborate her output in R.  (Disclaimer:  I may own a handful of R books but I'm not, by any means, nearly as proficient in R as in SAS.)  She sent me a snippet of her code in R and she was trying to run a bivariate logistic regression where the independent variable (IV) was categorical with a half-dozen categories.  Although you can obtain a single estimate (e.g. odds ratio) for the IV, this rarely makes much sense unless the IV is ordinal and the effect on the dependent variable (DV) is thought to be linear.  Unlike SAS, it isn't immediately obvious how to modify the general linear model command (glm) in R to obtain odds ratios for k-1 levels of a categorical variable, relative to level k.  Turns out, you don't modify the glm command:  you declare the variable to be a "factor" variable prior to running the command.  In SAS, the LOGISTIC procedure allows you declare any categorical variables as "class" variables before the MODEL command with the option to explicitly specify which one of the levels will be the referent.  (In R, you need to "re-level" the categorical variable with the relevel() command.)  This seems fairly intuitive in retrospect, but given that I had never conducted a logistic regression in R and haven't been using the program much lately, I had to browse through a couple of my R books -- "SAS and R:  Data Management, Statistical Analysis, and Graphics" (Kleinman & Horton) and "Modern Applied Statistics with S" (Venables & Ripley) -- to get where I needed to go.  The dataset I used to compare outputs in R and SAS was referenced in Venables and Ripley's text but originally came from Hosmer and Lemeshow's "Applied Logistic Regression".  The dataset contains n=189 observations and sought to identify risk factors leading to low birth weight babies.  Since my friend was most keen on estimating odds ratios for a categorical variable, I decided to run the model presented on page 32 of Hosmer & Lemeshow where the IVs were race and subject (mother) weight (this also allowed me to merry up my results with those contained in the textbook).  The code for each program follows:

R
library(MASS)

# ## read in low birth weight data (Hosmer & Lemeshow, 1989)
attach(birthwt)

# ##verify dataset contents
str(birthwt)

# ##change categorical independent variable to factor variable
#IV
birthwt$race <- factor(birthwt$race,
labels=c("white","black","other"))
birthwt$race <- as.factor(birthwt$race)
table(birthwt$race) # ##Logistic model birthwt.glm <- glm(low ~ lwt + race, family=binomial, data=birthwt) summary(birthwt.glm) # ##exponentiate coefficients and return confidence intervals exp(birthwt.glm$coefficients)
exp(confint(birthwt.glm, level=.95))

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
# if want to change reference category of race, then use -relevel()- and re-run GLM
birthwt$race <- relevel(birthwt$race, "other")

And the SAS code:
* **verify data contents
proc contents data=birthwt position; run;

* **format race variable;
proc format;
value racef 1='White' 2='Black' 3='Other';
run;
proc freq data=birthwt; tables race; format race racef.; run;

* **Logistic Model;
proc logistic data=birthwt descending;
class race (param=REF ref='White');
model low = lwt race / risklimits;
format race racef.;
run;

The coefficients and odds ratios (exponentiated coefficients) returned in the two programs were identical, although the confidence intervals on the odds ratios exhibited negligible differences.

## Friday, August 5, 2011

### Macro: May I Have This Date?

In my last job I wrote a SAS program that read in 150+ quotes (mostly snarky, oxymoronic, and anti-religious that were all of my choosing), randomly selected one, then printed it at the bottom of a daily email (batch verification) sent to the project biostatisticians.  (I'm not sure how I justified doing this when there were more pressing tasks to be completed.  Oh wait -- I remember -- a little levity was needed to temper the dysfunction infecting the project.)

At any rate, the random selection of the daily quote in my SAS program required just a couple of steps.  First, I assigned a 'random' number from the uniform distribution to each quote with the 'seed' being the current date.  (If you don't specify a seed, I think SAS may default to using the current date, but in an effort to minimize confusion, I chose to be explicit.)  I then sorted the quotes in ascending order according to the randomly generated numbers with the first one being selected for output to the daily email.  I wanted to duplicate this process in Stata (probably should have been working on my proposal...oh well) but instead of outputting the quote to an automatically generated email, I sought to write a program (.ado file) that would display a randomly selected quote into the output window with the calling of my user-written command, -quote-.  Seemed straightforward enough, until I had to assign the current date to a macro in Stata...

In both SAS and Stata, dates are numbers in that the value assigned to the date is the number of days lapsed (or preceding) January 1, 1960.  This means that unless you format the date variables (or macros) you create to display a date format (e.g. "08/05/2011", "20110805", "August 5, 2011", etc.), you'll simply get the number of days since 01/01/1960 (e.g. 18,844).  This makes working with dates relatively simple and intuitive, assuming that the system date is stored as number.  In SAS, you can assign the current date to a variable or macro by invoking the system-stored current date via the today() function (per SAS documentation:  "a function that returns a SAS date value corresponding to the date on which the SAS program is initiated").  In Stata, however, the system date value is accessed by invoking either the global macro, $S_DATE, or the system value for the current date [c(current_date)]. If I were interested in simply printing the current date on, say, an updated daily graph, I could just assign the system date to a macro, let's call it cdate, and be done with it since it defaults to printing the actual date ("5 Aug 2011"). But since I wanted the date to be a number (for seeding purposes), I needed the value to be the numeric date (i.e. the number of days since 01/01/1960). This required a quick search of the Statalist Archives to find out if any other users have dealt with the issue and if so, how. Fortunately for me, others had and the workaround was rather simple: remove the spaces from the date value then declare the value a Stata date formatted as day-month-year ("DMY"). (I didn't find exactly what I needed on the listserve, although I was able to tweak the suggestions proffered to get what I needed.) I then assigned this date value to a Stata macro. The code I used to assign a random number from the uniform distribution for each quote for both Stata and SAS follows: Stata local cdate = date(subinstr("$S_DATE" , " " , "" , .), "DMY")
* set seed to current date...
set seed `cdate'
* assign random number from uniform distribution
gen xselect = runiform()

SAS
*Generate random num for each quote using the uniform distrib.;
data RandomNumbers;
do i=1 to &NumQuotes;
r=ranuni(today());
output;
end;
run;

In the SAS program, I had to generate the random numbers in a separate dataset with the number of observations equaling the number of quotes (&NumQuotes) then merge said dataset with the dataset containing the quotes.  In Stata this step wasn't necessary -- I was able to create a new variable, xselect, containing randomly generated numbers from the uniform distribution directly into the dataset containing the quotes.  In both the Stata and SAS programs, I then sorted by the variable containing the randomly generated numbers -- xselect in Stata and r in SAS -- and chose the first quote (i.e. the quote with the lowest random number value).

After identifying the first quote in each program, I assigned the quote to a macro then output the quote to either an automatically-generated email (SAS) or the output window (Stata).  With my departure from my previous job, I'm obviously no longer subjecting myself to a barrage of daily emails, although the snark, oxymoronica, or anti-religious quotes are just one short command away in Stata...

. quote
Quote of the day:
Maybe this world is just another planet's hell. (Aldous Huxley)