[R] ave function

Wed Aug 21 04:27:38 CEST 2013

HI,

I guess your original dataset would have some list elements as empty.

Clean<- structure(list(GRADE = c(1, 2, 3, 1.5, 1.75, 2, 0.5, 2, 3.5, 
3.5, 3.75, 4), TERM = c(9L, 9L, 9L, 8L, 8L, 8L, 9L, 9L, 9L, 8L, 
8L, 8L), INST_NUM = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L)), .Names = c("GRADE", "TERM", "INST_NUM"), class = "data.frame", row.names = c(NA, 
-12L))

  lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))
#$`8.1`

#    Shapiro-Wilk normality test
#
#data:  x$GRADE
#W = 1, p-value = 1
#

#$`9.1`
#
 #   Shapiro-Wilk normality test
#
#data:  x$GRADE
#W = 1, p-value = 1

-----------------------------------------------------
  sapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE)$p.value)
#8.1 9.1 8.2 9.2 
 # 1   1   1   1 
with(Clean, aggregate(GRADE,list(TERM,INST_NUM),FUN=shapiro.test)) #the output is a list, 
#  Group.1 Group.2 x
#1       8       1 1
#2       9       1 1
#3       8       2 1
#4       9       2 1
#Warning message:
#In format.data.frame(x, digits = digits, na.encode = FALSE) :
 # corrupt data frame: columns will be truncated or padded with NAs

library(plyr)
ldply(dlply(Clean,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value)
#  TERM INST_NUM pval
#1    8        1    1
#2    8        2    1
#3    9        1    1
#4    9        2    1

Now, consider this example:

Clean1<- structure(list(GRADE = c(1, 2, 3, 1.5, 1.75, 2, 0.5, 2, 3.5, 
3.5, 3.75, 4, 4.5, 4.25, 4.32), TERM = c(9L, 9L, 9L, 8L, 8L, 
8L, 9L, 9L, 9L, 8L, 8L, 8L, 10L, 10L, 10L), INST_NUM = c(1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("GRADE", 
"TERM", "INST_NUM"), class = "data.frame", row.names = c(NA, 
-15L))
lapply(split(Clean1,list(Clean1$TERM,Clean1$INST_NUM)),function(x) shapiro.test(x$GRADE))
#Error in shapiro.test(x$GRADE) : sample size must be between 3 and 5000

 split(Clean1,list(Clean1$TERM,Clean1$INST_NUM))[[6]] ###0 rows
#[1] GRADE    TERM     INST_NUM
#<0 rows> (or 0-length row.names)

lst1<-split(Clean1,list(Clean1$TERM,Clean1$INST_NUM))
lapply(lst1[lapply(lst1,nrow)>0], function(x) shapiro.test(x$GRADE))
#$`8.1`
#
 #   Shapiro-Wilk normality test
#
#data:  x$GRADE
#W = 1, p-value = 1

You could do this directly with:
 ldply(dlply(Clean1,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value)
#  TERM INST_NUM      pval
#1    8        1 1.0000000
#2    8        2 1.0000000
#3    9        1 1.0000000
#4    9        2 1.0000000
#5   10        1 0.5248807
 ldply(dlply(Clean1,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value,stat1=statistic)
#  TERM INST_NUM      pval     stat1
#1    8        1 1.0000000 1.0000000
#2    8        2 1.0000000 1.0000000
#3    9        1 1.0000000 1.0000000
#4    9        2 1.0000000 1.0000000
#5   10        1 0.5248807 0.9393788

#or
 with(Clean1, aggregate(GRADE,list(TERM,INST_NUM),FUN=function(x) shapiro.test(x)$p.value)) 
  Group.1 Group.2         x
1       8       1 1.0000000
2       9       1 1.0000000
3      10       1 0.5248807
4       8       2 1.0000000
5       9       2 1.0000000

#If you want both pvalue and statistic
with(Clean1, aggregate(GRADE,list(TERM,INST_NUM),FUN=function(x) cbind(shapiro.test(x)$p.value,shapiro.test(x)$statistic)) )
#  Group.1 Group.2       x.1       x.2
#1       8       1 1.0000000 1.0000000
#2       9       1 1.0000000 1.0000000
#3      10       1 0.5248807 0.9393788
#4       8       2 1.0000000 1.0000000
#5       9       2 1.0000000 1.0000000

Hope this helps.

A.K.

________________________________
From: Robert Lynch <robert.b.lynch at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Cc: R help <r-help at r-project.org> 
Sent: Tuesday, August 20, 2013 8:00 PM
Subject: Re: [R] ave function

I tried 
> lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))

 and I got

>Error in shapiro.test(x$GRADE.) : sample size must be between 3 and 5000

I also tried
with(Clean, aggregate(GRADE,list(TERM,INST_NUM),FUN=shapiro.test))

and got
  Group.1 Group.2         x
1   201001  689809 0.9546164
2   201201  689809 0.9521624
3   201301  689809 0.9106206
4   200701  994474 0.8862705
5   200710  994474 0.9176743
6   201203 1105752 0.9382688
.
.
.
72  201001 1759272 0.9291295
73  201101 1759272 0.9347072
74  201110 1897809 0.9395375
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

I am not sure how to interpret the output of the second.

Thanks!

On Tue, Aug 13, 2013 at 11:01 AM, arun <smartpink111 at yahoo.com> wrote:

Hi,
>You could try:
> lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))
>A.K.
>
>
>
>
>
>----- Original Message -----
>From: Robert Lynch <robert.b.lynch at gmail.com>
>To: r-help at r-project.org
>Cc:
>Sent: Tuesday, August 13, 2013 1:46 PM
>Subject: [R] ave function
>
>I've written the following function
>CoursePrep <- function (Source, SaveName) {
>
>
>  Clean$TERM <- as.factor(Clean$TERM)
>
>  Clean$INST_NUM <- as.factor(Clean$INST_NUM)
>  Clean$zGrade <- with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
>scale))
>  write.csv(Clean,paste(SaveName, "csv", sep ="."), row.names = FALSE)
>  return(Clean)
>}
>
>which is all well and good, but I wan't to throw a shapiro.test in before I
>normalize.  that is I don't really understand quite how I did ( I got help)
>what I wanted to in the
>Clean$zGrade <- with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN = scale))
>that code for the whole of Clean finds all sets of GRADE.'s that have the
>same INST_NUM and TERM computes a mean, subtracts off the mean and divides
>by the standard deviation. I would like to for each one of those sets of
>grades to call shapiro.test() on the set, to see if it is normal *before* I
>assume it is.
>
>I know the naive
>with(Clean, shapiro.test( list(TERM, INST_NUM)))
>doesn't work.
>with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
>function(x)shapiro.test(x)))
>
>which returns
>Error in shapiro.test(x) : sample size must be between 3 and 5000
>and I have checked that the sets selected are all of length between 3 and
>5000.
>using the following on my full data
>
>ClassSize <- with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
>function(x)length(x)))
>> summary(ClassSize)
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>   22.0   198.0   241.0   244.4   279.0   466.0
>
>here is some sample data
>GRADE     TERM     INST_NUM
>1,              9,           1
>2,              9,           1
>3,              9,           1
>1.5,           8,           2
>1.75,         8,           2
>2,              8,          2
>0.5,           9,           2
>2,              9,          2
>3.5,           9,          2
>3.5,            8,         1
>3.75,          8,         1
>4,               8,          1
>
>and hopefully the code would test the following set of grades
>(1,2,3)(1.5,1.75,2)(0.5,2,3.5)(3.5,3.75,4)
>
>Thanks Robert
>
>    [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>