Tuesday, June 21, 2011

Popularity Contest

Stata, SAS, or R?  Not quite a life-or-death decision, but one most data analysts encounter at some point in their professional/academic lives.  Most data analysts profess some sort of allegiance to a particular program -- we aren't quite as fanatical as Red Sox or Yankee fans, however -- although most of us have strong views about the strengths and weaknesses of the programs we use as well as the ones we don't use.

My first encounter with a data analysis software was in a "Statistics for Scientists and Engineers" course over ten years ago.  The course included a lab where we had to use SAS to input & manipulate data, generate summary & descriptive statistics, and output graphs.  Although the tasks and the material were relatively straightforward, all I remember is being so confused.  I just didn't get it.  I couldn't understand how you could just declare a variable in a DATA step and be done with it.  To say I wasn't a natural would be a gross understatement.  My next experience using SAS (undergraduate Econometrics course), fortunately, was much more encouraging, as were all the experiences following, even going so far as to take (and pass) the Base SAS Competency Exam shortly after finishing my masters.

My experience with Stata, however, couldn't have been more different.  Shortly after starting my masters, a classmate recruited me to help her TA an "Introduction to Biostatistics" course in the Master of Public Health (MPH) program where our primary responsibility was to teach the lab component of the course using Stata.  Maybe it was my exposure to SAS two years prior and the chance to develop a "programming sense", but Stata was more intuitive.  I got it.  And I enjoyed teaching it to others.  I even subscribed to the Stata listserv, bought a Stata mug, and eventually gave a presentation at a Stata User Group meeting about how the course I TA'd incorporated Stata into the curriculum.  I'm planning to use Stata as my primary data management and statistical analysis program for my dissertation. 

I don't consider myself an expert in either SAS or Stata -- I think you'd have to spend the better part of your working adult life programming in either/both language(s) to make that claim -- and am even less so with R.  I first encountered R as a masters student -- I even bought a textbook about how to conduct regression using R -- but was so enamored with Stata that I didn't develop much skill using it.  Fast forward several years and R emerges as one of the most used, discussed, and capable data management/analysis programs on the market.  Everyone seems to be using it -- academics and pharmaceutical companies alike -- with little indication that its growth is about to slow.  And it certainly helps that R is freely available for download with its capabilities continuing to grow everyday.  I started using R in earnest -- mostly estimation of cumulative incidence between treatment groups and output of accompanying graphs -- while working as a biostatistician for a contract research organization and although my current use has scaled back from an already low level, I will continue to use it when needed.

So which data analysis software is most popular?  My experience using SAS, Stata, and R has been driven by both happenstance and circumstance -- a history likely shared by many data analysts -- thus making it difficult to unequivocally determine which software is most popular.  This hasn't stopped Robert Muenchen, a statistical consultant with ~29 years of experience, from trying to find out, however.  In his periodically updated analysis of which software is most popular, accessible here, he presents various ways of measuring popularity and market share.  Although no one software comes out on top with respect to every metric, R appears to be chattered about the most on the web (figure 1), Stata appears to be emerging as a formidable software program in academic circles (figure 2), and SAS (as well as SPSS) appear most often among the software skills that employers are seeking (figure 3).  All three figures and their captions are taken from the r4stats.com website.
Figure 1.  Plot of listserv discussion traffic by year (through 12/31/2010).
Figure 2. Impact of data analysis software on academic publications as measured by hits on Google Scholar.
Figure 3. Number of jobs listing each software package in its requirements on June 27, 2010. The maximum they will display is 1,000.
And, finally, the point?  Well, no one program appears to trump all the others.  So the best course of action?  Accumulate experience with all three, but master one (or two).

Thursday, June 9, 2011

Pre-Proposal Template, v.1

The research/dissertation phase of a doctoral program can be very peculiar.  Some programs provide a well-worn map of the research/dissertation process whereas others, well, not so much.  Perhaps in the latter this is intentional and meant to emulate the research process in 'the real world'?  Or maybe this is the graduate-level version of having to walk uphill to school both ways?  I don't know.  But either way, my program is of the latter flavor and although I appreciate the lessons I'm learning about development of a research topic, I also appreciate efficiency, mentorship, and guided-but-restrained direction.  As I write this, I realize that maybe my views are terribly misguided and naive, but at this point, I'm unable to completely sever the idea that the process can't be improved.  And thus the topic of this post.  While trying to identify a research topic and distill it into a workable thesis I iterated through a couple of prospectuses (research idea #1), then a couple of concept papers (research idea #2), and now a pre-proposal (still research idea #2 -- yay!).  Up until a few of months ago, I was creating and circulating these documents under the impression that they weren't really required (the documents helped to shape and guide my thinking).  Turns out I was wrong; apparently all that is required is a one page description of the topic.  But even this appears to not be 100% accurate, considering what a fellow student assembled and submitted to the faculty.  Assuming that past student experience is more reliable that present prescription, a 3-6 page double-spaced pre-proposal appears to be what is really sought.  Embedded is the format I'm currently using (my pre-proposal has yet to be accepted and until it is, I'll delay posting the format of the final version).