Monday, August 8, 2011


A friend (and former class mate) and me exchanged emails a few days ago after I had recommended a couple of R books on her Fb page (she was soliciting the help of someone with access to SAS) and during the email exchange, I offered to run her code in SAS so as to corroborate her output in R.  (Disclaimer:  I may own a handful of R books but I'm not, by any means, nearly as proficient in R as in SAS.)  She sent me a snippet of her code in R and she was trying to run a bivariate logistic regression where the independent variable (IV) was categorical with a half-dozen categories.  Although you can obtain a single estimate (e.g. odds ratio) for the IV, this rarely makes much sense unless the IV is ordinal and the effect on the dependent variable (DV) is thought to be linear.  Unlike SAS, it isn't immediately obvious how to modify the general linear model command (glm) in R to obtain odds ratios for k-1 levels of a categorical variable, relative to level k.  Turns out, you don't modify the glm command:  you declare the variable to be a "factor" variable prior to running the command.  In SAS, the LOGISTIC procedure allows you declare any categorical variables as "class" variables before the MODEL command with the option to explicitly specify which one of the levels will be the referent.  (In R, you need to "re-level" the categorical variable with the relevel() command.)  This seems fairly intuitive in retrospect, but given that I had never conducted a logistic regression in R and haven't been using the program much lately, I had to browse through a couple of my R books -- "SAS and R:  Data Management, Statistical Analysis, and Graphics" (Kleinman & Horton) and "Modern Applied Statistics with S" (Venables & Ripley) -- to get where I needed to go.  The dataset I used to compare outputs in R and SAS was referenced in Venables and Ripley's text but originally came from Hosmer and Lemeshow's "Applied Logistic Regression".  The dataset contains n=189 observations and sought to identify risk factors leading to low birth weight babies.  Since my friend was most keen on estimating odds ratios for a categorical variable, I decided to run the model presented on page 32 of Hosmer & Lemeshow where the IVs were race and subject (mother) weight (this also allowed me to merry up my results with those contained in the textbook).  The code for each program follows:


# ## read in low birth weight data (Hosmer & Lemeshow, 1989)

# ##verify dataset contents

# ##change categorical independent variable to factor variable
birthwt$race <- factor(birthwt$race,
birthwt$race <- as.factor(birthwt$race)
# ##Logistic model
birthwt.glm <- glm(low ~ lwt + race, family=binomial,

# ##exponentiate coefficients and return confidence intervals
exp(confint(birthwt.glm, level=.95))

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
# if want to change reference category of race, then use -relevel()- and re-run GLM
birthwt$race <- relevel(birthwt$race, "other")  

And the SAS code:
* **verify data contents
proc contents data=birthwt position; run;

* **format race variable;
proc format;
  value racef 1='White' 2='Black' 3='Other';
proc freq data=birthwt; tables race; format race racef.; run;

* **Logistic Model;
proc logistic data=birthwt descending;
  class race (param=REF ref='White');
  model low = lwt race / risklimits;
  format race racef.;

The coefficients and odds ratios (exponentiated coefficients) returned in the two programs were identical, although the confidence intervals on the odds ratios exhibited negligible differences.

No comments:

Post a Comment