[R] SAS to R migration questions

Matthew Wilson matt at overlook.homelinux.net
Sat Sep 11 15:39:20 CEST 2004


I'd like to get away from SAS, but I don't really know R well enough at
this point to know if it would be good for this project.  I tried to
describe the essence of the project below without getting bogged down in

It starts when I receive a data flat file.  There's lots of columns, but
the relevant ones are:

    custid  (customer ID number)
    saledt  (date of sale)
    salepx  (sale price)

Step 1:

I read in this data into a SAS dataset.  Some of these flat files hold
several gigabytes of data.  SAS allows indexes to be created on columns
which really speeds up queries. 

I read the R import/export doc and it suggested using databases for
really big datasets.  I figured I'd probably use perl or python to read
the file and convert it to either an R .tab file or to load the data
into a SQL database for the big files (Postgres or MySQL, since I'm
trying to go 100% open source with this).

Step 2:

In the data, I'll usually find one row per sale, but occasionally, a
sale will be entered incorrectly at first, then later reversed, then a
third line will show the correct sale data:

    custid      saledt      salepx
    111         8/1/2004    $75
    111         9/1/2004    $50
    112         10/1/2004   $30
    112         10/1/2004   ($30)
    112         10/1/2004   $20

The fourth line reverses the third line by showing a negative charge for
the same customer ID and sale date, and the last line is the correct
line.  I want to compress all those adjustments and reversals lines out
of the data, so the outgoing data would look like this:

    custid      saledt      salepx
    111         8/1/2004    $75
    111         9/1/2004    $50
    112         10/1/2004   $20

In SAS, I use a proc summary step in SAS to accomplish this:

    proc summary data=d1;
        class custid saledt;
        var salepx;
        output out=d2 sum=;

This is where I need help:  how to do this step in R?

Step 3:

I print a list of number of sales per customer ID, ranking the customer
IDs from most to least.  I use a SAS proc freq step for this:

    proc freq data=d2 order=freq;
        tables custid;

and the output would look like this:

    custid      freq
    111         2
    112         1

Again, I have no idea how to do step 3 in R.  

Thanks in advance!  All help is welcome.  Is this kind of work what R is
good at?

My public key:
gpg --recv-keys --keyserver www.mandrakesecure.net 0x8D10BFD5

More information about the R-help mailing list