Tuesday, June 21, 2011

Popularity Contest

Stata, SAS, or R?  Not quite a life-or-death decision, but one most data analysts encounter at some point in their professional/academic lives.  Most data analysts profess some sort of allegiance to a particular program -- we aren't quite as fanatical as Red Sox or Yankee fans, however -- although most of us have strong views about the strengths and weaknesses of the programs we use as well as the ones we don't use.

My first encounter with a data analysis software was in a "Statistics for Scientists and Engineers" course over ten years ago.  The course included a lab where we had to use SAS to input & manipulate data, generate summary & descriptive statistics, and output graphs.  Although the tasks and the material were relatively straightforward, all I remember is being so confused.  I just didn't get it.  I couldn't understand how you could just declare a variable in a DATA step and be done with it.  To say I wasn't a natural would be a gross understatement.  My next experience using SAS (undergraduate Econometrics course), fortunately, was much more encouraging, as were all the experiences following, even going so far as to take (and pass) the Base SAS Competency Exam shortly after finishing my masters.

My experience with Stata, however, couldn't have been more different.  Shortly after starting my masters, a classmate recruited me to help her TA an "Introduction to Biostatistics" course in the Master of Public Health (MPH) program where our primary responsibility was to teach the lab component of the course using Stata.  Maybe it was my exposure to SAS two years prior and the chance to develop a "programming sense", but Stata was more intuitive.  I got it.  And I enjoyed teaching it to others.  I even subscribed to the Stata listserv, bought a Stata mug, and eventually gave a presentation at a Stata User Group meeting about how the course I TA'd incorporated Stata into the curriculum.  I'm planning to use Stata as my primary data management and statistical analysis program for my dissertation. 

I don't consider myself an expert in either SAS or Stata -- I think you'd have to spend the better part of your working adult life programming in either/both language(s) to make that claim -- and am even less so with R.  I first encountered R as a masters student -- I even bought a textbook about how to conduct regression using R -- but was so enamored with Stata that I didn't develop much skill using it.  Fast forward several years and R emerges as one of the most used, discussed, and capable data management/analysis programs on the market.  Everyone seems to be using it -- academics and pharmaceutical companies alike -- with little indication that its growth is about to slow.  And it certainly helps that R is freely available for download with its capabilities continuing to grow everyday.  I started using R in earnest -- mostly estimation of cumulative incidence between treatment groups and output of accompanying graphs -- while working as a biostatistician for a contract research organization and although my current use has scaled back from an already low level, I will continue to use it when needed.

So which data analysis software is most popular?  My experience using SAS, Stata, and R has been driven by both happenstance and circumstance -- a history likely shared by many data analysts -- thus making it difficult to unequivocally determine which software is most popular.  This hasn't stopped Robert Muenchen, a statistical consultant with ~29 years of experience, from trying to find out, however.  In his periodically updated analysis of which software is most popular, accessible here, he presents various ways of measuring popularity and market share.  Although no one software comes out on top with respect to every metric, R appears to be chattered about the most on the web (figure 1), Stata appears to be emerging as a formidable software program in academic circles (figure 2), and SAS (as well as SPSS) appear most often among the software skills that employers are seeking (figure 3).  All three figures and their captions are taken from the r4stats.com website.
Figure 1.  Plot of listserv discussion traffic by year (through 12/31/2010).
Figure 2. Impact of data analysis software on academic publications as measured by hits on Google Scholar.
Figure 3. Number of jobs listing each software package in its requirements on June 27, 2010. The maximum they will display is 1,000.
And, finally, the point?  Well, no one program appears to trump all the others.  So the best course of action?  Accumulate experience with all three, but master one (or two).

No comments:

Post a Comment