Tuesday, November 4, 2014

Subset and Hash dimension in run time

In data warehousing, the dimension may be very big, e.g. PRODUCT. To hash the big dimension, we have to increase memory or decrease dimension size.
Starting from SAS 9.3, we can subset and hash big dimension at run time.
Of course it will affect performance if the hash load happens too often. However, the process is straightforward and easy to maintain.
Below is the example:
data big_dimension;
    do geo=1,2,3;
        prod="test";
        output;
    end;
run;

data fact;
    do geo=1,2,3,4,5;
    output;
    end;
run;

data fact_new;
    set fact;
    by geo;

    drop string rc;

    if 0 then set big_dimension (keep=geo prod);

    if first.geo then do;
        string = cats("data dim_subset / view=dim_subset; set big_dimension; where geo=",geo,";run;");
        rc =dosubl(string);
        declare hash h (dataset:"dim_subset");
        rc = h.definekey("geo");
        rc = h.definedata("prod");
        rc = h.definedone();
    end;

    call missing(prod);
    rc = h.find();

    if last.geo then do;
    end;
run;

Monday, September 15, 2014

How to read CSV using PROC IMPORT

I believe you have read a lot paper for how to read CSV file. Here I want to add some helpful tips:
#1.
We can control the TERMSTR, ENCODING and LRECL using fileref, which may affect how PROC IMPORT handle the file.
For example, we can read UNIX format file in Win SAS.
filename in "&incsvfile" termstr=lf encoding=utf8 lrecl=32767;
proc cimport in=in out=work.test dbms=csv; 
    getnames = yes;
    datarow = 2;
    guessingrows = 2147483647;
run;

#2.
CSV file may not be read correctly when null value is using two double-quote. To fix this, please set the macro variable EFI_NOQUOTED_DELIMITER as follows:
%let EFI_NOQUOTED_DELIMITER = yes;

#3.
I always add the option GUESSINGROWS with maximum value to make sure all data are read. The option does not exist when I blog first CSV tip on 2009. :)
guessingrows = 2147483647;

Thursday, September 11, 2014

Table lookup using FCMP hashing

There are many way to look up table: SQL join, Data hashing, Format, and etc.
Starting from SAS 9.3, we can use hashing technique in PROC FCMP. In this way, we can lookup table using function. Below is the sample code:
* Scenario: Replace PROD_HIER_SK with PROD_ID.;
proc fcmp outlib=work.funcs.prod;
 function prod_id(in_prod_hier_sk) $ 200;
  length prod_id $ 200;

  /* executable for only once */
  declare hash h(dataset:"di_dm.product_dm");
  rc = h.definekey("prod_hier_sk");
  rc = h.definedata("prod_id");
  rc = h.definedone();

  /* look up first and return if found */
  call missing(prod_id);
  prod_hier_sk = in_prod_hier_sk;
  rc = h.find();

  return(prod_id);
 endsub;
run;

options cmplib=(work.funcs);

data sample_fcmp_hashing (keep=prod_hier_sk prod_id parent_prod_hier_sk parent_prod_id);
 set di_dm.product_hier_assoc_dm;

 prod_id = prod_id(prod_hier_sk);
 parent_prod_id = prod_id(parent_prod_hier_sk);
run;


Reference:
Andrew Henrick, Donald Erdman, and Stacey Christian. 2013. “Hashing in PROC FCMP to Enhance Your Productivity”, Proceedings of the SAS Global Forum 2013 Conference
Sample 47224: Load a SAS data set into a Hash Object using PROC FCMP

BTW, interestingly, you can find more information from SAS paper than from SAS Help doc.
As an experienced SAS programmer, I always keep an eye on SAS Global Forum.

System options to introduce code

I don't want to change any existing code in the complicated SAS environment. Because the codes may be there for years and are working well, you have to explain or communicate a lot to make one little harmless change. It is smart to insert the code using system options when you kick off SAS batch.
#1. -append/-insert the program in autoexec system option
$ sas_batch.sh sas_program.sas -append autoexec sas_program_full_path

#2. -initstmt/-termstmt sytem option
$ sas_batch.sh sas_program.sas -initstmt '*sas codes here;'

Scenario:
All utility macros are stored and reused using macro autocall facility, when you want to test new macro (e.g. macro_new.sas), you can run the same work with new macro as follows:
$ sas_batch.sh sas_program.sas -append autoexec macro_new.sas
$ sas_batch.sh sas_program.sas -initstmt '%inc "macro_new.sas";'
In this way, we can test the new macro without change any existing codes.

Friday, June 6, 2014

SYSCC - bridge between SAS and OS

SYSCC automatic macro variable plays important role in the communication between SAS and the operating environment.

To use it efficiently, please keep in mind the points at below:
1. SYSCC will take only integer value, even for decimal value. Furthermore, it will prompt error when non-numeric value is assigned.
42   %let syscc=-1.3;
43   %put &=syscc;
SYSCC=-1
44   %let syscc=test;
ERROR: The value supplied for assignment to the numeric automatic macro variable SYSCC was out of range or could not be converted to a numeric value.

2. SYSCC code will be translated to a return code on SAS exit. For more information, please see host-specific features of the SAS language.
3. SYSCC always take bigger numeric value, which means it retains highest error code in the end.
4. When SYSCC has error code (>=5), it does NOT stop executing any subsequent SAS statements while ABORT statement does.

Thursday, May 15, 2014

How to execute an operating environment command from SAS session

The topic has been discussed in so many SAS papers. Below are my considerations:

1. X statement / X command
It is the most popular one. I always use it in interactive mode. However, I would suggest that we should not use it in batch since it is difficult to control task and get status.

2. %SYSEXEC macro statement
As you know, macro string quoting always a challenge to SAS programmer. If there is any special character in the command, it may take you much time for troubleshooting. To make the code clean and readable, I prefer not using %SYSEXEC in our production code.

3. SYSTEM function / CALL SYSTEM routine
It works good with DATA step.

4. FILENAME statement PIPE engine
It is convenient to get the output of command.
Tip: to get the return code of the command, we can use this: FILENAME CMD PIPE "command; echo $?";

5. SYSTASK statement
It is my favorite one. It is because you will have more control on the command: execute many tasks in parallel, list tasks, kill task, and get status easily. Furthermore, it is the only way to execute command asynchronously.

Wednesday, May 14, 2014

colon(:) operator modifier - Note 2

WHERE statement is so useful tool to subset data. However, colon(:) operator modifier can not be used in WHERE statement.
Currently I can find some workarounds at below:
1.
PROC SQL truncated string comparison operators such as EQT, GTT, and LET:
They are undocumented operators because I can not find them in SAS document. For more information, please read SUGI paper 056-2009.

Please note that they have to be used in WHERE statement of PROC SQL.
2.
LIKE operator:
WHERE name like 'J%'; * all name with leading J character;

3.
To encapsulate the operator in FCMP:
proc fcmp outlib=work.funcs.trial;
   function func_in(a $, b $);
      if a =: b then result=1;
      else result=0;

      return (result);
   endsub;
run;

options cmplib=work.funcs;

data test;
    set sashelp.class;
    if func_in(name, 'J');
run;

Although they can not really replace the colon(:) operator modifier, they are useful to open your mind.

colon(:) operator modifier - Note 1

When string with different length are compared, the colon(:) operator modifier will be very helpful since it truncate longer string before comparison.
However, what is the "length"? Is it the actual length of of a non-blank character string (from LENGTH function)? or the amount of memory that is allocated for the string (from LENGTHM function)?

To clarify the question, please see the sample:
data a;
    length a1 a4 $10;
    a1 = 't';
    a4 = 'test';
    
    flag_a1 = ifc(a1=:'te', 'Y', 'N');
    flag_a4 = ifc(a4=:'te', 'Y', 'N');

    put a1= flag_a1=:;
    put a4= flag_a4=;
run;

Output:
a1=t flag_a1=N  (Explanation: a1 is not truncated because the memory length is 10. the implicit string comparison is 't' = 'te'.)
a4=test flag_a4=Y


Conclusion: the colon(:) operator modifier will truncate the longer string or pad pad with blank the shorter one.