[R] Is R the right choice for simulating first passage times of random walks?

R. Michael Weylandt <michael.weylandt@gmail.com> michael.weylandt at gmail.com
Mon Aug 1 18:43:37 CEST 2011


I've only got a 20 minute layover, but three quick remarks:

1) Do a sanity check on your data size: if you want a million walks of a thousand steps, that already gets you to a billion integers to store--even at a very low bound of one byte each, thats already 1GB for the data and you still have to process it all and run the OS. If you bump this to walks of length 10k, you are in big trouble. 

Considered like that, it shouldn't surprise you that you are getting near memory limits. 

If you really do need such a large simulation and are willing to make the time/space tradeoff, it may be worth doing simulations in smaller batches (say 50-100) and aggregating the needed stats for analysis. Also, consider direct use of the rm() function for memory management. 

2) If you know that which.max()==1 can't happen for your data, might this trick be easier than forcing it through some tricky logic inside the which.max()

X=which.max(...)
if(X[1]==1) X=Inf # or whatever value

3) I dont have any texts at hand to confirm this but isn't the expected value of the first hit time of a RW infinite? I think a  handwaving proof can be squeezed out of the optional stopping theorem with T=min(T_a,T_b) for a<0<b and let a -> -Inf. 

If I remember right, this suggests you are trying to calculate a CI for a distribution with no finite moments, a difficult task to say the least. 

Hope these help and I'll write a more detailed reply to your notes below later,

Michael Weylandt

PS - what's an iterated RW? This is all outside my field (hence my spitball on #2 above)

PS2 - sorry about the row/column mix-up: I usually think of sample paths as rows...

On Aug 1, 2011, at 8:49 AM, Paul Menzel <paulepanter at users.sourceforge.net> wrote:

> Am Sonntag, den 31.07.2011, 23:32 -0500 schrieb R. Michael Weylandt :
>> Glad to help -- I haven't taken a look at Dennis' solution (which may be far
>> better than mine), but if you do want to keep going down the path outlined
>> below you might consider the following:
> 
> I will try Dennis’ solution right away but looked at your suggestions
> first. Thank you very much.
> 
>> Instead of throwing away a simulation if something starts negative, why not
>> just multiply the entire sample by -1: that lets you still use the sample
>> and saves you some computations: of course you'll have to remember to adjust
>> your final results accordingly.
> 
> That is a nice suggestion. For a symmetric random walk this is indeed
> possible and equivalent to looking when the walk first hits zero.
> 
>> This might avoid the loop:
>> 
>> x = ## Whatever x is.
>> xLag = c(0,x[-length(x)]) # 'lag' x by 1 step.
>> which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count
>> things, this +1 may be extraneous.
>> 
>> The inner expression sets a 0 except where there is a switch from negative
>> to positive and a one there: the which.max function returns the location of
>> the first maximum, which is the first 1, in the vector. If you are
>> guaranteed the run starts negative, then the location of the first positive
>> should give you the length of the negative run.
> 
> That is the same idea as from Bill [1]. The problem is, when the walk
> never returns to zero in a sample, `which.max(»everything FALSE)`
> returns 1 [2]. That is no problem though, when we do not have to worry
> about a walk starting with a positive value and adding 1 (+1) can be
> omitted when we count the epochs of first hitting 0 instead of the time
> of how long the walk stayed negative, which is always one less.
> 
> Additionally my check `(x>=0) & (xLag <0)` is redundant when we know we
> start with a negative value. `(x>=0)` should be good enough in this
> case.
> 
>> This all gives you,
>> 
>> f4 <- function(n = 100000, # number of simulations
>>               length = 100000) # length of iterated sum
>> {
>>       R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)
>> 
>>>       R = apply(R,1,cumsum)
>>> 
>>                      R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row
> 
> The line above seems to look the columns instead of rows. I think the
> following is correct since after the `apply()` above the random walks
> are in the columns.
> 
>    R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)]
> 
>>>       fTemp <- function(x) {
>>> 
>>                            xLag = c(0,x[-length(x)])
>>                            return(which.max((x>=0) & (xLag <0))+1)
>> 
>>>       countNegative = apply(R,2,fTemp)
>>>       tabulate(as.vector(countNegative), length)
>>> }
>> 
>> That just crashed my computer though, so I wouldn't recommend it for large
>> n,length.
> 
> Welcome to my world. I would have never thought that simulating random
> walks with a length of say a million would create that much data and
> push common desktop systems with let us say 4 GB of RAM to their limits.
> 
>> Instead, you can help a little by combining the lagging and the &
>> all in one.
>> 
>> f4 <- function(n = 100000, llength = 100000)
>> {
>>    R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)
>>    R = apply(R,1,cumsum)
>>    R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row
>>    R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)
>>    countNegative = apply(R,1,which.max) + 1
>>    return (tabulate(as.vector(countNegative), length) )
>> 
>>> }
> 
> I left that one out, because as written above the check can be
> shortened.
> 
>> Of course, this is all starting to approach a very specific question that
>> could actually be approached much more efficiently if it's your end goal
>> (though I think I remember from your first email a different end goal):
> 
> That is true. But to learn some optimization techniques on a simple
> example is much appreciated and will hopefully help me later on for the
> iterated random walk cases.
> 
>> We can use the symmetry and "restart"ability of RW to do the following:
>> 
>> x = cumsum(sample(c(-1L,1L),BIGNUMBER,replace=T)
>> D  = diff(which(x == 0))
> 
> Nice!
> 
>> This will give you a vector of how long x stays positive or negative at a
>> time. Thinking through some simple translations lets you see that this set
>> has the same distribution as how long a RW that starts negative stays
>> negative.
> 
> I have to write those translations down. On first sight though we need
> again to handle the case where it stays negative the whole time. `D`
> then has length 0 and we have to count that for a walk longer than
> `BIGNUMBER`.
> 
>> Again, this is only good for answering a very specific question
>> about random walks and may not be useful if you have other more complicated
>> questions in sight.
> 
> Just testing for 0 for the iterated cases will not be enough for
> iterated random walks since an iterated random walk can go from negative
> to non-negative without being zero at this time/epoch.
> 
> I implemented all your suggestions and got the following.
> 
> -------- 8< -------- code -------- >8 --------
> f4 <- function(n = 100000, # number of simulations
>               length = 100000) # length of iterated sum
> {
>    R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n)
>    R = apply(R,1,cumsum) ## this applies cumsum `row-wise' to R and will make your life INFINITELY better
>    fTemp <- function(x) {
>        if (x[1] >= 0 ) {
>            return(1)
>        }
> 
>        for (i in 1:length-1) {
>            if (x[i] < 0 && x[i + 1] >= 0) {
>                return(as.integer(i/2 + 2))
>            }
>        }
>    }
>    countNegative = apply(R,2,fTemp)
>    tabulate(as.vector(countNegative), length)
> }
> 
> f5 <- function(n = 100000, # number of simulations
>               length = 100000) # length of iterated sum
> {
>    R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)
> 
>    R = apply(R,1,cumsum)
>    R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row
> 
>    R <- R>=0
>    countNegative = apply(R,2,which.max)
>    tabulate(as.vector(countNegative), length)
> }
> 
> f6 <- function(n = 100000, # number of simulations
>               length = 100000) # length of iterated sum
> {
>    x = cumsum(sample(c(-1L,1L), length*n,replace=T))
>    D = diff(which(c(0, x) == 0))
>    tabulate(D, max(D))
> }
> -------- 8< -------- code -------- >8 --------
> 
> The timings differ quite much which is expected though.
> 
>> # f1 is using only for loops but only does half the calculations
>> # and does not yet flip random walks starting with a positive value.
>> set.seed(1) ; system.time( z1 <- f1(300, 1e5) )
>       User      System verstrichen
>      2.700       0.008       2.729
>> # f1 adapted with flips
>> set.seed(1) ; system.time( z1f <- f1withflip(300, 1e5) )
>       User      System verstrichen 
>      4.457       0.004       4.475
>> set.seed(1) ; system.time( z4 <- f4(300, 1e5) )
>       User      System verstrichen
>      8.033       0.380       8.739
>> set.seed(1) ; system.time( z5 <- f5(300, 1e5) )
>       User      System verstrichen
>      9.640       0.812      10.588
>> set.seed(1) ; system.time( z6 <- f6(300, 1e5) )
>       User      System verstrichen
>      4.208       0.328       4.606
> 
> So `f6` seems to be the most efficient setting right now and even is
> slightly faster than `f1` with the for loops. But we have to keep in
> mind that both operate on different data sets although `set.seed(1)` is
> used and `f6` treats the problem totally different.
> 
> One other thought is that when reusing the walks starting with a positiv
> term and flipping those we can probably also take the backward/reverse
> walk (dual problem). I will try that too.
> 
> 
> Thank you very much,
> 
> Paul
> 
> 
> [1] https://stat.ethz.ch/pipermail/r-help/2011-July/285015.html
> [2] https://stat.ethz.ch/pipermail/r-help/2011-July/285396.html



More information about the R-help mailing list