[R] R dataframe and looping help
arun
smartpink111 at yahoo.com
Tue Sep 3 05:29:26 CEST 2013
HI Satish,
colnames(Output)[4]<- colnames(dat2)[i]; #guess this line should be:
colnames(x1)[4]<- colnames(dat2)[i]
Regarding the warning, I used
read.table(..., stringsAsFactors=FALSE). In your case, you might need to either use that option while reading the data or convert the factor variables to character class.
Check:
str(Output)
I forgot about sorting the data. You can use either ?sort() or ?order
dat1New<-dat1[order(dat1$CustID,as.Date(dat1$TripDate,"%d-%b-%y"),dat1$Store),] #in the example data, it didn't change anything
dat2<- dat1New[,-c(1:3)]
str(dat1New)
'data.frame': 7 obs. of 7 variables:
$ CustID : int 1 1 1 1 2 2 2
$ TripDate: chr "2-Jan-12" "6-Jan-12" "9-Jan-12" "31-Mar-13" ... ##should be factor in your original dataset
$ Store : chr "a" "c" "a" "a" ... #####
$ Bread : int 2 0 3 3 0 3 3
$ Butter : int 0 3 3 0 3 3 0
$ Milk : int 2 3 0 0 3 0 0
$ Eggs : int 1 0 0 0 0 0 0
Suppose, I read the data with stringsAsFactors=TRUE (default is this option)
dat1<- read.table(text="
CustID TripDate Store Bread Butter Milk Eggs
1 2-Jan-12 a 2 0 2 1
1 6-Jan-12 c 0 3 3 0
1 9-Jan-12 a 3 3 0 0
1 31-Mar-13 a 3 0 0 0
2 31-Aug-12 a 0 3 3 0
2 24-Sep-12 a 3 3 0 0
2 25-Sep-12 b 3 0 0 0
",sep="",header=TRUE)
str(dat1)
'data.frame': 7 obs. of 7 variables:
$ CustID : int 1 1 1 1 2 2 2
$ TripDate: Factor w/ 7 levels "24-Sep-12","25-Sep-12",..: 3 6 7 5 4 1 2
$ Store : Factor w/ 3 levels "a","b","c": 1 3 1 1 1 1 2
$ Bread : int 2 0 3 3 0 3 3
$ Butter : int 0 3 3 0 3 3 0
$ Milk : int 2 3 0 0 3 0 0
$ Eggs : int 1 0 0 0 0 0 0
dat2<- dat1[,-c(1:3)]
res<- lapply(seq_len(ncol(dat2)),function(i) {x1<-cbind(dat1[,c(1:3)],dat2[,i]);colnames(x1)[4]<- colnames(dat2)[i];x2<-x1[x1[,4]!=0,];within(x2, {daysbetweentrips<-unlist(tapply(as.Date(x2$TripDate,"%d-%b-%y"),list(x2$CustID),function(x) c(NA,as.numeric(diff(x)))));previoustripstore<-ave(x2$Store,x2$CustID,FUN=function(x) c(NA,x[-length(x)]));Nexttripstore<- ave(x2$Store,x2$CustID,FUN=function(x) c(x[-1],NA))})})
Warning messages:
1: In `[<-.factor`(`*tmp*`, i, value = c(NA, 1L, 1L)) :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, i, value = c(NA, 1L)) :
invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, i, value = c(1L, 1L, NA)) :
invalid factor level, NA generated
---------------------------------------------------
To convert to character class after reading the data:
dat1[]<-lapply(dat1,function(x) if(is.factor(x)) as.character(x) else x)
str(dat1)
#'data.frame': 7 obs. of 7 variables:
# $ CustID : int 1 1 1 1 2 2 2
# $ TripDate: chr "2-Jan-12" "6-Jan-12" "9-Jan-12" "31-Mar-13" ...
# $ Store : chr "a" "c" "a" "a" ...
# $ Bread : int 2 0 3 3 0 3 3
# $ Butter : int 0 3 3 0 3 3 0
# $ Milk : int 2 3 0 0 3 0 0
# $ Eggs : int 1 0 0 0 0 0 0
dat2<- dat1[,-c(1:3)]
res<- lapply(seq_len(ncol(dat2)),function(i) {x1<-cbind(dat1[,c(1:3)],dat2[,i]);colnames(x1)[4]<- colnames(dat2)[i];x2<-x1[x1[,4]!=0,];within(x2, {daysbetweentrips<-unlist(tapply(as.Date(x2$TripDate,"%d-%b-%y"),list(x2$CustID),function(x) c(NA,as.numeric(diff(x)))));previoustripstore<-ave(x2$Store,x2$CustID,FUN=function(x) c(NA,x[-length(x)]));Nexttripstore<- ave(x2$Store,x2$CustID,FUN=function(x) c(x[-1],NA))})}) #works
A.K.
Hi Arun-
Thanks for this...
I ran this code. without the days between trips... Can you please
confirm the paranthesis and code looks right.?. they do to me....
res<- lapply(seq_len(ncol(dat2)),function(i)
{
x1<-cbind(Output[,c(1:3)],dat2[,i]);
colnames(Output)[4]<- colnames(dat2)[i];
x2<-x1[x1[,4]!=0,];
previoustripstore<-ave(x2$store,x2$CUSTID,FUN=function(x) c(NA,x[-length(x)]));
Nexttripstore<- ave(x2$store,x2$CUSTID,FUN=function(x) c(x[-1],NA))
}
)
But i get an warning like this:In `[<-.factor`(`*tmp*`, i, value = c(NA, 3L, 3L, 3L, ... :
invalid factor level, NA generated
Wat might be wrong? Please help
Thanks,
Satish
----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: R help <r-help at r-project.org>
Cc:
Sent: Monday, September 2, 2013 5:01 PM
Subject: Re: R dataframe and looping help
HI,
You may try this:
dat1<- read.table(text="
CustID TripDate Store Bread Butter Milk Eggs
1 2-Jan-12 a 2 0 2 1
1 6-Jan-12 c 0 3 3 0
1 9-Jan-12 a 3 3 0 0
1 31-Mar-13 a 3 0 0 0
2 31-Aug-12 a 0 3 3 0
2 24-Sep-12 a 3 3 0 0
2 25-Sep-12 b 3 0 0 0
",sep="",header=TRUE,stringsAsFactors=FALSE)
dat2<- dat1[,-c(1:3)]
res<- lapply(seq_len(ncol(dat2)),function(i) {x1<-cbind(dat1[,c(1:3)],dat2[,i]);colnames(x1)[4]<- colnames(dat2)[i];x2<-x1[x1[,4]!=0,];within(x2, {daysbetweentrips<-unlist(tapply(as.Date(x2$TripDate,"%d-%b-%y"),list(x2$CustID),function(x) c(NA,as.numeric(diff(x)))));previoustripstore<-ave(x2$Store,x2$CustID,FUN=function(x) c(NA,x[-length(x)]));Nexttripstore<- ave(x2$Store,x2$CustID,FUN=function(x) c(x[-1],NA))})})
res
#[[1]]
# CustID TripDate Store Bread Nexttripstore previoustripstore daysbetweentrips
#1 1 2-Jan-12 a 2 a <NA> NA
#3 1 9-Jan-12 a 3 a a 7
#4 1 31-Mar-13 a 3 <NA> a 447
#6 2 24-Sep-12 a 3 b <NA> NA
#7 2 25-Sep-12 b 3 <NA> a 1
#[[2]]
# CustID TripDate Store Butter Nexttripstore previoustripstore
#2 1 6-Jan-12 c 3 a <NA>
#3 1 9-Jan-12 a 3 <NA> c
#5 2 31-Aug-12 a 3 a <NA>
#6 2 24-Sep-12 a 3 <NA> a
# daysbetweentrips
#2 NA
#3 3
#5 NA
#6 24
#[[3]]
# CustID TripDate Store Milk Nexttripstore previoustripstore daysbetweentrips
#1 1 2-Jan-12 a 2 c <NA> NA
#2 1 6-Jan-12 c 3 <NA> a 4
#5 2 31-Aug-12 a 3 <NA> <NA> NA
#[[4]]
# CustID TripDate Store Eggs Nexttripstore previoustripstore daysbetweentrips
#1 1 2-Jan-12 a 1 <NA> <NA> NA
A.K.
Hi, I have a very quick question.. I have a data which has sales per
category per trip of each customer at different store locations, like
below..(dataset1 frome xcel attachment) CustID TripDate Store Bread Butter Milk Eggs
1 2-Jan-12 a 2 0 2 1
1 6-Jan-12 c 0 3 3 0
1 9-Jan-12 a 3 3 0 0
1 31-Mar-13 a 3 0 0 0
2 31-Aug-12 a 0 3 3 0
2 24-Sep-12 a 3 3 0 0
2 25-Sep-12 b 3 0 0 0 Here i have shown 4 items and their sales per customer per trip at each
store... However, my data contains around 100 columns with item names..
All i need to do is following: 1. Create a separate dataframe for each item. That is, create 100
dataframs one for each item.. Within the dataframe for Butter, for
example, will be contained columns 1-3 and Butter column, specifically
filtered for rows where butter>0 in sales..(so rows 1,4,7 will be
dropped from this dataframe)..Likewise for all items...(sample output
for butter is: (dataset2) CustID TripDate Store Butter
1 6-Jan-12 c 3
1 9-Jan-12 a 3
2 31-Aug-12 a 3
2 24-Sep-12 a 3 2. In same loop, create new derived variables within each dataframe for
each item... like create a lag variable for TripDate, create lag
variable for storename in next trip, storename in previous trip etc...
and also # days between trips to each store for each customer...(an
example for Butter dataframe with new derived variables would be...)
Dataset needs to be sorted by CustID, TripDate, Store before creating
derived variables (dataset3)Book1.xlsx CustID TripDate Store Butter NextTripstore previoustripstore
daysbetweentrips
1 6-Jan-12 c 3 a - -
1 9-Jan-12 a 3 - c -
2 31-Aug-12 a 3 a - -
2 24-Sep-12 a 3 - a 24 Point of creating multiple item level dataframes is, i will use them
iteratively as i will perform some regression on these datasets, using
same set of variables each time
More information about the R-help
mailing list