"Last night I lay in bed looking up at the stars in the sky and I thought to myself, where the heck is the ceiling."
Here is a great SAS tip that requires a bit of background explanation, but it's worth it!
The tip is about how the data step works. When you read in a data set with the set statement like below, SAS processes every column and row of the input file. If the data set is long and wide it will take SAS a long time to finish!
Data Output;
Set Input;
Run;
If you were only interested in a few columns, you might you be tempted to do something like this:
Data Output (keep = Branch Line Date);
Set Input;
Run;
Which will create an output data set with just the 3 columns, Branch Line and Date. However! using this method means that SAS still has to read in every column and row from the Input data set. Instead it's far better to do this:
Data Output;
Set Input (keep = Branch Line Date);
Run;
Here we are telling SAS to only read in the three columns Branch Line and Date from the input data set. Every other column will be ignored so the process of reading in the Input data set will be far more efficient and quick!
But what if we only want a subset of Branches, Lines or Dates?
Say we just want to see the Lines in Branch 308, we could to the following:
data Output;
set Input (keep = Branch Line Date);
where Branch = 308;
run;
Once the Input data set has been processed, every record of the Branch column would then be re-checked to see if the branch was equal to 308, as demanded by the Where statement.
But we know that the Set statement reads in every column and row of the input data set, so why not check if the Branch is equal to 308 at the same time to prevent double work?
data Output;
set Input (keep = Branch Line Date where = (Branch = 308) );
run;
By using Where in the Set statement, you can drastically improve the efficiency of the data step.
Additionally, if you have multiple conditions on the data you want to bring in, make sure the first condition is most unlikely, the set statement will stop processing a record if a condition is not met and move on to the next one. The earlier you can stop processing a record you don't want, the less work SAS will have to do!
data Output;
set Input (keep = Branch Line Date where = ( Date = '24Dec2010'd and Branch = 308 ) );
run;
There are plenty of other things you can do in the Set statement. I'd recommend you do as much here as possible. Enjoy!
data Output;
set Input (Keep = Branch Line Date Rename=(Line = Product) where = (Branch = 308) );
run;
Good post! I know how to use those where=, drop=, keep= within data output; set input; run; but never realised it can speed up the data processing! Thanks.
ReplyDelete