Tuesday, November 12, 2013

Counting and Listing Duplicates

Identifying and listing duplicates is such a crucial data management task that you'd think there would be resources all over the web showing how to do it but, strangely, I didn't find that to necessarily be the case.   For a task I'm currently working on, I had to create a counter variable that counted the number of records with duplicate values for two ID variables and as an afterthought, I also wanted to list the duplicates in their entirety.  That is, I didn't want just the duplicate records output, I also wanted the "parent" record such that in the listing you could see the parent (counter = 1) as well as the duplicates (counter = 2...k).  Creating the counter variable was straightforward enough (the ATS UCLA stat resource has a good tutorial) but listing of the duplicates wasn't as obvious.  For that task I relied on Ron Cody's "Cody's Data Cleaning Techniques Using SAS Software".  Here is some generic code:

* **create counter;
proc sort data=dsin; by id1 id2; run;

data dsin;
 set dsin;
 by id1 id2;
 if first.id1  or first.id2 then counter=0;
  counter + 1;
run;

* **write duplicates to dataset then print;
proc sort data=dsin; by id1 id2; run;

data dsin_dups;
  set dsin;
  by id1 id2;
  if first.id2 AND last.id2 then delete;
run;

proc print data=dsin_dups noobs; 
  var counter id1 id2 tx_description;
  title "Duplicate Records";

run;

No comments:

Post a Comment