Wednesday, June 25, 2014

Easily Count Missing Values in SAS

If counting the number of missing values among groups of variables is the task (something I had to do today), then using PROC FORMAT seems to be the easiest and most efficient method if the variables are character whereas if the variables are numeric, PROC MEANS is a good approach.  The code snippet below was borrowed from "Cody's Data Cleaning Techniques Using SAS Software" by Ron Cody.  

Character Variables:

* **format for missing vs. non-missing values;
proc format;
  value $misscnt
    ' ' = 'Missing'
    OTHER = 'Non-Missing';

proc freq data=dsin;
  tables _character_ / nocum missing;
  format _character_ $misscnt.;

And for numeric variables:

proc means data=dsin n nmiss;
  var _numeric_;

It really can't get much simpler than this.  

Monday, June 2, 2014

SAS, Excel, and Leading Zeros

Working with Excel can sometimes be (usually is?) a less-than-pleasant experience, especially if you have to interface with a statistical software program (e.g. SAS, Stata).  This interfacing usually begins with moving the data from Excel to SAS and although seemingly simple and straightforward, there are small (and often overlooked) considerations that really annoy.  The loss of leading zeros from Excel to SAS is one.  

A Google search of some combination of "SAS", "Excel", and "leading zeros" returns several hits but as I clicked through them, there wasn't any one that I found 100% satisfactory.  One option was to change the numeric values in the Excel spreadsheet to text but even this didn't seem to retain the leading zeros (even in Excel!).  Another option was to use DDE but this required an INPUT statement and knowing the cell range --- two annoying intermediate steps I'd rather not deal with.  I think PROC IMPORT might also be an option if the data is saved as CSV but this is of marginal use if using an XLSX file.  One method, though, seemed to have some promise --- the Excel LIBNAME engine --- but when I first tried it, the leading zeros were not retained.  Turns out, though, this problem is easily rectified by adding a FORMAT statement to the DATA step.  Note that the variables are numeric but if character variables are sought, a PUT statement can be easily substituted for the FORMAT statement.  The code is below:

libname RawValue EXCEL "<Pathname Here>\<FileName Here>.xlsx";
data dsnout; 
  set RawValue."Sheet1$"n;
  format var1 var2 vark Z5.0;
libname RawValue clear;