Wednesday, November 27, 2013

Create Fake Data: SAS vs. Stata

There are a lot of resources in both SAS and Stata for accessing fabricated (or publicly available) data shipped with the software program (e.g. -sysuse- in Stata) but it isn't immediately obvious how to create fake data from scratch.  I'm not sure if this is because doing so is largely unnecessary due to the availability of _actual_ data but I figured it would be useful to know how to create a fictional dataset on the fly if, for instance, I wanted a break from using the program datasets or if none of them were suitable for my needs.  

In SAS, a single DATA step can generate several variables then output them to a SAS dataset.  A cursory Google search turned up a paper from a SAS Users Group Meeting by Andrew J. L. Cary discussing creation of data in SAS that was quite informative.  I adapted some of his code and eventually coded the block below:

data fiction (drop=i);
 *set seed;
 seed = 20131126;
 array sitelist [5] $10 _temporary_ ('WashDC', 'Princeton', 'Chicago', 'Cambridge', 'Oakland');
  do id = 1 to 150;
site = sitelist[rantbl(seed,.3,.4,.1,.1)];
   gender = int(ranuni(seed)+0.5);
weight = 175 + rannor(seed)*30;
output;
  end;
run;


In Stata, a paper by Maarten Buis proved quite helpful in guiding my coding.  Unlike in SAS where everything can be accomplished in a single discrete DATA step, Stata requires several distinct steps that begin with, most importantly, setting the number of observations in the _fake_ dataset.  Each of the variables are created with a series of -generate- statements.  

In both SAS and Stata, I set the seed so that within each program the results (e.g. frequencies, summary statistics) could be replicated if run at a later time.


// Task #1
* **create observations;
set obs 150


// Task #2
* **create variables via series of -gen- commands;
set seed 20131126

* **using uniform distribution for random draws;
gen rand = uniform()

* **site-region (8 site-regions);
gen siteregion = cond(rand < .15, 1, ///
                 cond(rand < .30, 2, ///
               cond(rand < .40, 3, ///
               cond(rand < .55, 4, ///
               cond(rand < .75, 5, ///
               cond(rand < .90, 6, ///
               cond(rand < .95, 7, ///
               8)))))))
 
* **site (5 sites);
gen site = cond(rand < .3, 1, ///
           cond(rand < .7, 2, ///
          cond(rand < .8, 3, ///
          cond(rand < .9, 4, ///
          5))))
  
* **gender (2 genders);
gen gender = rand < 0.5

* **weight (continuous:  mean 175 and sd 30);
gen weight = rnormal(175,30)

 
// Task #3
* **assign value labels;
label define sitereg 1 "Pacific" 2 "Mountain" 3 "Mid-West" 4 "South" 5 "Mid-Atlantic" ///
  6 "NorthEast" 7 "North" 8 "West"
label values siteregion sitereg

label define site 1 "WashDC" 2 "Princeton" 3 "Chicago" 4 "Cambridge" 5 "Oakland"
label values site site

No comments:

Post a Comment