Wednesday, November 27, 2013

Create Fake Data: SAS vs. Stata

There are a lot of resources in both SAS and Stata for accessing fabricated (or publicly available) data shipped with the software program (e.g. -sysuse- in Stata) but it isn't immediately obvious how to create fake data from scratch.  I'm not sure if this is because doing so is largely unnecessary due to the availability of _actual_ data but I figured it would be useful to know how to create a fictional dataset on the fly if, for instance, I wanted a break from using the program datasets or if none of them were suitable for my needs.  

In SAS, a single DATA step can generate several variables then output them to a SAS dataset.  A cursory Google search turned up a paper from a SAS Users Group Meeting by Andrew J. L. Cary discussing creation of data in SAS that was quite informative.  I adapted some of his code and eventually coded the block below:

data fiction (drop=i);
 *set seed;
 seed = 20131126;
 array sitelist [5] $10 _temporary_ ('WashDC', 'Princeton', 'Chicago', 'Cambridge', 'Oakland');
  do id = 1 to 150;
site = sitelist[rantbl(seed,.3,.4,.1,.1)];
   gender = int(ranuni(seed)+0.5);
weight = 175 + rannor(seed)*30;
output;
  end;
run;


In Stata, a paper by Maarten Buis proved quite helpful in guiding my coding.  Unlike in SAS where everything can be accomplished in a single discrete DATA step, Stata requires several distinct steps that begin with, most importantly, setting the number of observations in the _fake_ dataset.  Each of the variables are created with a series of -generate- statements.  

In both SAS and Stata, I set the seed so that within each program the results (e.g. frequencies, summary statistics) could be replicated if run at a later time.


// Task #1
* **create observations;
set obs 150


// Task #2
* **create variables via series of -gen- commands;
set seed 20131126

* **using uniform distribution for random draws;
gen rand = uniform()

* **site-region (8 site-regions);
gen siteregion = cond(rand < .15, 1, ///
                 cond(rand < .30, 2, ///
               cond(rand < .40, 3, ///
               cond(rand < .55, 4, ///
               cond(rand < .75, 5, ///
               cond(rand < .90, 6, ///
               cond(rand < .95, 7, ///
               8)))))))
 
* **site (5 sites);
gen site = cond(rand < .3, 1, ///
           cond(rand < .7, 2, ///
          cond(rand < .8, 3, ///
          cond(rand < .9, 4, ///
          5))))
  
* **gender (2 genders);
gen gender = rand < 0.5

* **weight (continuous:  mean 175 and sd 30);
gen weight = rnormal(175,30)

 
// Task #3
* **assign value labels;
label define sitereg 1 "Pacific" 2 "Mountain" 3 "Mid-West" 4 "South" 5 "Mid-Atlantic" ///
  6 "NorthEast" 7 "North" 8 "West"
label values siteregion sitereg

label define site 1 "WashDC" 2 "Princeton" 3 "Chicago" 4 "Cambridge" 5 "Oakland"
label values site site

Tuesday, November 12, 2013

Counting and Listing Duplicates

Identifying and listing duplicates is such a crucial data management task that you'd think there would be resources all over the web showing how to do it but, strangely, I didn't find that to necessarily be the case.   For a task I'm currently working on, I had to create a counter variable that counted the number of records with duplicate values for two ID variables and as an afterthought, I also wanted to list the duplicates in their entirety.  That is, I didn't want just the duplicate records output, I also wanted the "parent" record such that in the listing you could see the parent (counter = 1) as well as the duplicates (counter = 2...k).  Creating the counter variable was straightforward enough (the ATS UCLA stat resource has a good tutorial) but listing of the duplicates wasn't as obvious.  For that task I relied on Ron Cody's "Cody's Data Cleaning Techniques Using SAS Software".  Here is some generic code:

* **create counter;
proc sort data=dsin; by id1 id2; run;

data dsin;
 set dsin;
 by id1 id2;
 if first.id1  or first.id2 then counter=0;
  counter + 1;
run;

* **write duplicates to dataset then print;
proc sort data=dsin; by id1 id2; run;

data dsin_dups;
  set dsin;
  by id1 id2;
  if first.id2 AND last.id2 then delete;
run;

proc print data=dsin_dups noobs; 
  var counter id1 id2 tx_description;
  title "Duplicate Records";

run;

Friday, November 1, 2013

F5, F6, and F7 in SAS

I've started working in SAS much more now that I'm reporting to an office job everyday.  (Not that full-time work on a dissertation isn't a job, but I digress...)  Anyway, I don't like to switch from the keyboard to the mouse to click on a different window while working in SAS --- I usually Ctrl+Tab between the windows but if you have multiple programming (Editor) windows open or you don't want to tab through the Explorer or Results windows then this can get cumbersome.  A quick Google search of SAS shortcut keys took me to a SAS help page where I learned these gems:

F5:  Directly tabs to the programming (Editor) window(s)
F6:  Directly tabs to the Log window
F7:  Directly tabs to the Output window.

I suppose moving the hands from the keyboard to the mouse isn't that much work but if you're doing it dozens, hundreds even, of times a day, the effort (and annoyance) can add up.