[R] r-data partitioning considering two variables (character and numeric)

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Tue Aug 28 01:50:47 CEST 2018


Sorry, my bad -- careless reading: you need to do the partitioning within
genotype.
Something like:

by(dataGenotype, dataGenotype$Genotype, function(x){

  u <- unique(x$standID)

   tst <- x$x2 %in% sample(u, floor(length(u)/2))

   list(test = x[tst,], train = x[!tst,]

   })


This will give a list each component of which will split the Genotype into
test and train dataframe subsets by ID. These lists of data frames can then
be recombined into a single test and train dataframe by, e.g. an
appropriate rbind() call.


HOWEVER, note that you will need to modify this function to decide what to
do if/when there is only one ID in a Genotype, as Don MacQueen already
pointed out.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Aug 27, 2018 at 4:09 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> Just partition the unique stand_ID's and select on them using %in% , say:
>
> id <- unique(dataGenotype$stand_ID)
> tst <- sample(id, floor(length(id)/2))
> wh <- dataGenotype$stand_ID %in% tst ## logical vector
> test<- dataGenotype[wh,]
> train <- dataGenotype[!wh,]
>
> There are a million variations on this theme I'm sure.
>
> -- Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia <ahmedatia80 using gmail.com> wrote:
>
>> I would like to partition the following dataset (dataGenotype) based
>> on two variables; Genotype and stand_ID, for example, for Genotype
>> H13: stand_ID number 7 may go to training and stand_ID number 18 and
>> 21 may go to testing.
>>
>> Genotype    stand_ID    Inventory_date  stemC   mheight
>> H13             7        5/18/2006  1940.1075   11.33995
>> H13             7        11/1/2008  10898.9597  23.20395
>> H13             7        4/14/2009  12830.1284  23.77395
>> H13            18        11/3/2005  2726.42 13.4432
>> H13            18        6/30/2008  12226.1554  24.091967
>> H13            18        4/14/2009  14141.68    25.0922
>> H13            21        5/18/2006  4981.7158   15.7173
>> H13            21        4/14/2009  20327.0667  27.9155
>> H15            9         3/31/2006  3570.06 14.7898
>> H15            9         11/1/2008  15138.8383  26.2088
>> H15            9         4/14/2009  17035.4688  26.8778
>> H15           20         1/18/2005  3016.881    14.1886
>> H15           20        10/4/2006   8330.4688   20.19425
>> H15           20        6/30/2008   13576.5 25.4774
>> H15           32        2/1/2006    3426.2525   14.31815
>> U21           3         1/9/2006    3660.416    15.09925
>> U21           3         6/30/2008   13236.29    24.27634
>> U21           3         4/14/2009   16124.192   25.79562
>> U21           67        11/4/2005   2812.8425   13.60485
>> U21           67        4/14/2009   13468.455   24.6203
>>
>> And the desired output is the following;
>>
>> A-training
>>
>> Genotype    stand_ID    Inventory_date  stemC   mheight
>> H13            7         5/18/2006  1940.1075   11.33995
>> H13            7         11/1/2008  10898.9597  23.20395
>> H13            7         4/14/2009  12830.1284  23.77395
>> H15            9         3/31/2006  3570.06 14.7898
>> H15            9         11/1/2008  15138.8383  26.2088
>> H15            9         4/14/2009  17035.4688  26.8778
>> U21            67        11/4/2005  2812.8425   13.60485
>> U21            67        4/14/2009  13468.455   24.6203
>>
>> B-testing
>>
>> Genotype    stand_ID    Inventory_date  stemC   mheight
>> H13             18       11/3/2005  2726.42 13.4432
>> H13             18       6/30/2008  12226.1554  24.091967
>> H13             18       4/14/2009  14141.68    25.0922
>> H13             21       5/18/2006  4981.7158   15.7173
>> H13             21       4/14/2009  20327.0667  27.9155
>> H15             20       1/18/2005  3016.881    14.1886
>> H15             20       10/4/2006  8330.4688   20.19425
>> H15             20       6/30/2008  13576.5 25.4774
>> H15             32       2/1/2006   3426.2525   14.31815
>> U21             3        1/9/2006   3660.416    15.09925
>> U21             3        6/30/2008  13236.29    24.27634
>> U21             3        4/14/2009  16124.192   25.79562
>>
>> I tried the following code;
>>
>> library(caret)
>> dataPartitioning <-
>> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
>> train = dataGenotype[dataPartitioning,]
>> test = dataGenotype[-dataPartitioning,]
>>
>> Also tried
>>
>> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>>
>> It did not produce the desired output, the data are partitioned within
>> the stand_ID. For example, one row of stand_ID 7 goes to training and
>> two rows of stand_ID 7 go to testing. How can I partition the data by
>> Genotype and stand_ID together?.
>>
>>
>>
>> Ahmed Attia
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]




More information about the R-help mailing list