Wednesday, July 13, 2011

Workflow and Stata

I just finished reading J. Scott Long's "The Workflow of Data Analysis Using Stata" and this book is, in short, well-written and a great resource.  I bought this book because I wanted to approach and conduct my dissertation research much more systematically -- I didn't want to get into the situation of having a sprawling directory structure, multiple "analysis" datasets, and so many .do files I couldn't possibly reproduce my results.  I have yet to embark on any meaningful statistical analysis for my dissertation -- that follows submission of my proposal and IRB approval -- although I did assemble the directory structure, mapped out variable and value labels for key variables, and implemented a standard .do file structure using the advice proffered in this book. 

Prior to getting into the nitty-gritty, Long outlines the four steps in the workflow process:  data cleansing, analysis, results presentation, and file protection.  The first step -- data cleansing -- is, essentially, the crux of the book and constitutes the lion's share of the text.  In this first step -- it implicitly includes data preparation -- Long discusses how to plan, organize, and document your analysis, the mechanics of writing a smart and readable .do file, use of macros and loops to increase efficiency and reduce errors, and, of course, variable creation, naming, and labeling.  The section on assembling a directory structure is especially helpful -- Long provides example directories for simple projects, large one-person projects, and collaborative projects.  Long also provides the reader with a couple of templates for .do files:  one simple and one complex (I tried to figure out how to create something like this a couple of years ago but was unsuccessful).  The template I've started using in all of my analyses is the "complex" one and, per Long's direction, I load the template using the keystroke Alt+1 -- I set up the script using AutoHotKey -- although I haven't figured out how to have my name and date automatically loaded into the script (I manually add my name, the program name, and date after I load the template to the .do file).  An example of the script follows:

capture log close
log using _name_, replace

// program:
// task:
// project:
// author:    _who_
// born on date:  _date_

// #0
// program setup

version 11.2
clear all
macro drop _all
set memory 75m
set more off

// #1
// describe task 1

<other tasks here>

log close

Pretty nifty, eh?  Yeah, I thought so, too.   In addition to the .do file template, Long also suggests how .do files should be named so that all of your analysis can be run from a master .do file in a logical and reproducible way (incorporation of numbers into the .do file name such that the proper sequence of .do files is maintained).

Long concludes his text with chapters on presenting results and file protection, the former of which I've struggled with in terms of efficiency and the latter of which has occurred to me from time to time.  He motivates the section on presenting results with a quote from "The Art of Scientific Writing" by Ebel, Bliefort, and Russey:  "Underlying all of natural science is a rather remarkable understanding, albeit one that attracts relatively little attention:  Everything measured, detected, invented, or arrived at theoretically in the name of science must, as soon as possible, be made public -- complete with all the details."  Long then goes on to briefly discuss tables, spreadsheets, graphs, and outputting of results from estimation commands.  This section is a good jumping off point for further research into how best to output and present results from your statistical analysis.  The section on file protection usually gets short shrift in the grand scheme of data analysis, but Long manages to impart how important this step is -- and one I need to implement on a regular basis. 

If you are data analyst, statistician, researcher, or even a professor -- I suspect you'll benefit from this book (I know I will) -- especially since Long writes from the perspective of all four and, thus, draws from both his good and bad experiences.

No comments:

Post a Comment