[R] sampling rows with values never sampled before

Tue Jun 23 10:04:26 CEST 2015

If df is the data.frame with values and you want nn samples, then this 
is a slightly different approach:

# example data.frame:
df = data.frame(a1 = sample(1:20,50, replace = TRUE),
                             a2 =  sample(seq(0.1,10,length.out = 
30),50, replace = TRUE),
                             a3 = sample(seq(0.3, 20,length.out = 
20),50,replace = TRUE))
nrow = dim(df)[1] # 50
ncol = dim(df)[2]  # 3

# start by randomizing the order in your data.frame
randomOrder = sample(1:nrow, nrow, replace = FALSE)
dff = df[randomOrder,]

# find and remove all duplicates from all columns. With this you will 
only keep the first instance of any unique value:
rem = NULL
for (ic in 1:ncol) rem = c(rem, which(duplicated(dff[, ic])))
if (length(rem) > 0) dff = dff[-unique(rem),]

# Reduce to the length you need
if (dim(dff)[1] > nn)  res = dff[1:nn,] else res = dff

I am not sure how this scales if you have a really big data, and whether 
you could get some FAQ 7.31 problems depending on how you fill your 
data.frame.

Cheers,
Jon

On 6/23/2015 12:13 AM, C W wrote:
> Hi Jean,
>
> Thanks!
>
> Daniel,
> Yes, you are absolutely right.  I want sampled vectors to be as different
> as possible.
>
> I added a little more to the earlier data set.
>          x1  x2  x3
>   [1,]  1 3.7  2.1
>   [2,]  2 3.7  5.3
>   [3,]  3 3.7  6.2
>   [4,]  4 3.7  8.9
>   [5,]  5 3.7  4.1
>   [6,]  1 2.9  2.1
>   [7,]  2 2.9  5.3
>   [8,]  3 2.9  6.2
>   [9,]  4 2.9  8.9
> [10,]  5 2.9 4.1
> [11,]  1 5.2 2.1
> [12,]  2 5.2 5.3
> [13,]  3 5.2 6.2
> [14,]  4 5.2 8.9
> [15,]  5 5.2 4.1
>
> If I sampled row, 1, 6, 11, solving the system of equations will not be
> possible.  So, I am avoiding "similar vectors".
>
> Thanks,
>
> Mike
>
>
> On Mon, Jun 22, 2015 at 2:19 PM, Daniel Nordlund <djnordlund at frontier.com>
> wrote:
>
>> On 6/22/2015 9:42 AM, C W wrote:
>>
>>> Hello R list,
>>>
>>> I am have question about sampling unique coordinate values.
>>>
>>> Here's how my data looks like
>>>
>>>   dat <- cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))
>>>> dat
>>>>
>>>         x1  x2
>>>    [1,]  1 3.7
>>>    [2,]  2 3.7
>>>    [3,]  3 3.7
>>>    [4,]  4 3.7
>>>    [5,]  5 3.7
>>>    [6,]  1 2.9
>>>    [7,]  2 2.9
>>>    [8,]  3 2.9
>>>    [9,]  4 2.9
>>> [10,]  5 2.9
>>> [11,]  1 5.2
>>> [12,]  2 5.2
>>> [13,]  3 5.2
>>> [14,]  4 5.2
>>> [15,]  5 5.2
>>>
>>>
>>> If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).
>>>
>>> I want to avoid either the first or second coordinate repeated.  It leads
>>> to undefined matrix inversion.
>>>
>>> I thought of using sampling(), but not sure about applying it to a data
>>> frame.
>>>
>>> Thanks in advance,
>>>
>>> Mike
>>>
>>>
>> I am not sure you gave us enough information to solve your real world
>> problem.  But I have a few comments and a potential solution.
>>
>> 1. In your example the unique values in in x1 are completely crossed with
>> the unique values in x2.
>> 2. since you don't want duplicates of either number, then the maximum
>> number of samples that you can take is the minimum number of unique values
>> in either vector, x1 or x2 (in this case x2 with 3 unique values).
>> 3. Sample without replace from the smallest set of unique values first.
>> 4. Sample without replacement from the larger set second.
>>
>>> x <- 1:5
>>> xx <- c(3.7, 2.9, 5.2)
>>> s2 <- sample(xx,2, replace=FALSE)
>>> s1 <- sample(x,2, replace=FALSE)
>>> samp <- cbind(s1,s2)
>>>
>>> samp
>>       s1  s2
>> [1,]  5 3.7
>> [2,]  1 5.2
>> Your actual data is probably larger, and the unique values in each vector
>> may not be completely crossed, in which case the task is a little harder.
>> In that case, you could remove values from your data as you sample.  This
>> may not be efficient, but it will work.
>>
>> smpl <- function(dat, size){
>>    mysamp <- numeric(0)
>>    for(i in 1:size) {
>>      s <- dat[sample(nrow(dat),1),]
>>      mysamp <- rbind(mysamp,s, deparse.level=0)
>>      dat <- dat[!(dat[,1]==s[1] | dat[,2]==s[2]),]
>>      }
>>    mysamp
>> }
>>
>>
>> This is just an example of how you might approach your real world
>> problem.  There is no error checking, and for large samples it may not
>> scale well.
>>
>>
>> Hope this is helpful,
>>
>> Dan
>>
>> --
>> Daniel Nordlund
>> Bothell, WA USA
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Jon Olav Skøien
Joint Research Centre - European Commission
Institute for Environment and Sustainability (IES)
Climate Risk Management Unit

Via Fermi 2749, TP 100-01,  I-21027 Ispra (VA), ITALY

jon.skoien at jrc.ec.europa.eu
Tel:  +39 0332 789205

Disclaimer: Views expressed in this email are those of the individual and do not necessarily represent official views of the European Commission.