[R] sample consecutive integers efficiently

Chris Oldmeadow c.oldmeadow at student.qut.edu.au
Fri Aug 29 00:35:40 CEST 2008



Charles C. Berry wrote:
> On Thu, 28 Aug 2008, Chris Oldmeadow wrote:
>
>> Hi all,
>>
>> I have some rough code to sample consecutive integers with length 
>> according to a vector of lengths
>>
>> #sample space (representing positions)
>> pos<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
>>
>> #sample lengths
>> lengths<-c(2,3,2)
>>
>> From these two vectors I need a vector of sampled positions.
>>
>>
>> the sampling is without replacement, making things tough as the 
>> sampled integers need to be consecutive. Im hoping somebody knows a 
>> faster way of doing it than I have. ATM its way to slow on large 
>> vectors.
>>
>>
>> samplePos<-function(l){
>>    start.pos<-sample(pos,1)
>>    end.pos<-start.pos+l-1
>>    posies<-start.pos:end.pos
>>    posies
>> }
>>
>> s.start<-c()
>>
>>
>> newPos<-function(a){
>>    rp<-samplePos(a)
>>    #test sampled range is consecutive, if not resample
>>    if (length(rp) != rp[a]+1 -rp[1]){rp<-samplePos(a)}
>>    pos<-setdiff(pos,rp)
>>    rp[1]
>> }
>>
>> newps<-c()
>> newps<-unlist(lapply(lengths,newPos))
>>
>> I think the bottleneck may be on the setdiff() function - the sample 
>> space is quite large so I dont think there would be too many rejections.
>
>
> The bottleneck is in the formulation of the sampling scheme.
>
> This is a simple combinatorics problem. There are 3360 possible values 
> ( prod(16:14) ) for the start positions of the three elements, and you 
> can form a bijection between 1:3360 and the individual samples. If the 
> number of possible sample is small enough, it would be most efficient 
> to sample from the corresponding integer vector and then translate it 
> to the corresponding sample. For larger values where the number of 
> possible samples become a challenge for 32-bit integer arithmetic, I 
> expect this approach would be preferred:
>
> Permute length ( pos ) - sum ( lengths ) + length( lengths ) distinct 
> (consecutively labelled) elements:
>
>     elz <- sample( length ( pos ) - sum ( lengths ) + length( lengths ) )
>
>
> Take the lengths of the original objects to be
>
>     z.lens <- rep( 1, length( elz ) )
>     z.lens[ seq(along = lengths ) ] <- lengths
>
> (i.e. objects longer than 1 appear first)
>
> Determine the start positions of the objects as if they were laid down 
> consecutively according to the permutation:
>
>     start <- head( cumsum( c(0, z.lens[ elz ]) ) + 1 , -1 )
>
> Find the start positions of just those with lengths greater than 1
>
>     gt.1 <- match( seq(along=lengths) , elz )
>
> Report the start positions
>
>     start[ gt.1 ]
>
> ---
>
> If length( pos ) is large, you can rewrite the above to simply sample 
> the positions (in the ordering) of the objects with lengths greater 
> than 1. You will have to revise the calculation of start and gt.1 in 
> that case.
>
> HTH,
>
> Chuck
>
>
>  >
>>
>>
>> Many thanks,
>> Chris
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry                            (858) 534-2098
>                                             Dept of Family/Preventive 
> Medicine
> E mailto:cberry at tajo.ucsd.edu                UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 
> 92093-0901
>
>
>

Thanks! that worked perfectly.

Chris



More information about the R-help mailing list