[R] How to speed up interpolation

Mon Jul 18 22:35:35 CEST 2011

There is one problem. No matter what I do, I can't recover the correct
runway in the final list.
You had "rw = as.numeric(df$lrw) # index into 'levels' "

I have tried
   df$lrw = factor(df$lrw, ordered=TRUE)
    rwys = factor(unique(df$lrw), ordered=TRUE)    # Get the names of
the runways

  > rwys
  [1] 04R 27  04L 33L 15R 22L NON
  Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
  > head(df$lrw)
  [1] 04L 04L 04L 04L 04L 04L
  Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
Which seem to order things the same way.
  > rn = as.numeric(head(df$lrw))
  > rn
  [1] 1 1 1 1 1 1

So I should be able to get back my original runways with
> rwys[rn]
[1] 04R 04R 04R 04R 04R 04R
Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON

So I get 04R instead of 04L
> rwys[1]
[1] 04R
Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
> rwys[2]
[1] 27
Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON

I note that
> rwys = as.vector(rwys)
> rwys
[1] "04R" "27 " "04L" "33L" "15R" "22L" "NON"

So what dumb thing am I doing here? How do I reorder the original df$lrw
to match the order in rwys?

Thanks,
Jim

On 7/17/2011 10:11 PM, jim holtman wrote:
> Here is what I did; convert the data to a numeric matrix for faster
> processing.  You can convert back to a dataframe since you have the
> indices into the levels for the flights and runways.
>
>> # read in data
>> source('/temp/df/df')
>> # convert to matrix
>> df.mat <- cbind(pt = as.numeric(df$PredTime)
> +               , dt = as.numeric(df$dt)
> +               , rw = as.numeric(df$lrw)  # index into 'levels'
> +               , flight = as.numeric(df$flightfact)
> +               )
>> # create a list of row numbers for each flight for processing
>> flgt.list <- split(seq(nrow(df.mat)), df.mat[, 'flight'])
>> # remove lists with only 1 entry
>> flgt.list <- flgt.list[sapply(flgt.list, length) > 1]
>>
>> # create the interval we want data for
>> interval <- as.numeric(0:60)
>>
>> # now process the flights
>> times <- lapply(flgt.list, function(.flt){
> +     interp <- approx(df.mat[.flt, 'pt']
> +                     , df.mat[.flt, 'dt']
> +                     , xout = interval
> +                     , rule = 1
> +                     )
> +     # return vector
> +     cbind(time = interp$x
> +       , error = interp$y
> +       , runway = df.mat[.flt[1L], 'rw']
> +       , flight = df.mat[.flt[1L], 'flight']
> +       )
> + })
>> # sample output -- is this correct?
>> times[[1]]
>       time       error runway flight
>  [1,]    0          NA      2      1
>  [2,]    1          NA      2      1
>  [3,]    2 -0.13795380      2      1
>  [4,]    3 -0.20726073      2      1
>  [5,]    4 -0.27309237      2      1
>  [6,]    5 -0.33333333      2      1
>  [7,]    6 -0.09322419      2      1
>  [8,]    7  0.14688495      2      1
>  [9,]    8  0.38699409      2      1
> [10,]    9  0.62710323      2      1
> [11,]   10  0.86721237      2      1
> [12,]   11  1.10732151      2      1
> [13,]   12  1.34743065      2      1
> [14,]   13  1.58753979      2      1
> [15,]   14  1.82764893      2      1
> [16,]   15  2.06775807      2      1
> [17,]   16  2.30786721      2      1
> [18,]   17  2.54797635      2      1
> [19,]   18  6.66600000      2      1
> [20,]   19  4.82600000      2      1
> [21,]   20  3.00436508      2      1
> [22,]   21  2.22316562      2      1
> [23,]   22  1.34895178      2      1
> [24,]   23  0.47473795      2      1
> [25,]   24 -0.39947589      2      1
> [26,]   25 -1.27368973      2      1
> [27,]   26 -2.12478632      2      1
> [28,]   27 -1.61196581      2      1
> [29,]   28 -1.09914530      2      1
> [30,]   29 -0.58632479      2      1
> [31,]   30 -0.07350427      2      1
> [32,]   31  0.43931624      2      1
> [33,]   32  0.95213675      2      1
> [34,]   33  1.46495726      2      1
> [35,]   34  1.97777778      2      1
> [36,]   35  2.49059829      2      1
> [37,]   36  3.00341880      2      1
> [38,]   37  3.51623932      2      1
> [39,]   38  4.02905983      2      1
> [40,]   39  4.54188034      2      1
> [41,]   40  5.05470085      2      1
> [42,]   41  5.53360434      2      1
> [43,]   42  5.53766938      2      1
> [44,]   43  5.54173442      2      1
> [45,]   44  5.54579946      2      1
> [46,]   45  5.54986450      2      1
> [47,]   46  5.55392954      2      1
> [48,]   47  5.55799458      2      1
> [49,]   48  5.56205962      2      1
> [50,]   49  5.56612466      2      1
> [51,]   50  5.57018970      2      1
> [52,]   51  5.57425474      2      1
> [53,]   52  5.57831978      2      1
> [54,]   53  5.58238482      2      1
> [55,]   54  5.58644986      2      1
> [56,]   55  5.59051491      2      1
> [57,]   56  5.59457995      2      1
> [58,]   57  5.59864499      2      1
> [59,]   58  5.60271003      2      1
> [60,]   59  5.60677507      2      1
> [61,]   60  5.61084011      2      1
>
>
> On Sun, Jul 17, 2011 at 6:58 PM, James Rome <jamesrome at gmail.com> wrote:
>> I thought I had included the data... Here it is again.
>>
>> What I want to do is to make box and whisker plots with each flight
>> counted the same number of times in each time bin. Hence the
>> interpolation to minute time hacks.
>>
>>
>> On 7/17/2011 4:16 PM, jim holtman wrote:
>>> It would be nice if you had some sample data included so that we could
>>> see how the code worked.  Have you use Rprof on the code to see where
>>> you are spending your time?  You might want to use 'matrix' instead of
>>> 'data.frames' since there is a big performance impact with dataframes
>>> when indexing.  A little more description of the problem you are
>>> trying to solve would also be useful.  I tend to ask people "tell me
>>> what you want to do, not how you want to do it".
>>>
>>> On Sun, Jul 17, 2011 at 1:30 PM, James Rome <jamesrome at gmail.com> wrote:
>>>> df is a very large data frame with arrival estimates for many flights
>>>> (DF$flightfact) at random times (df$PredTime). The error of the estimate
>>>> is df$dt.
>>>> My problem is that I want to know the prediction error at each minute
>>>> before landing. This code works, but is very slow, and dominates
>>>> everything. I tried using split(), but that rapidly ate up my 12 GB of
>>>> memory. So, is there a better R way of doing this?
>>>>
>>>> Thanks,
>>>> Jim Rome
>>>>
>>>>    flights = table(df$flightfact[1:dim(df)[1], drop=TRUE])
>>>>    nflights = length(flights)
>>>>    flights = as.data.frame(flights)
>>>>    times = data.frame()
>>>>    # Split by flight
>>>>    for(i in 1:nflights) {
>>>>        tf = df[as.numeric(df$flightfact)==flights[i,1],]    # This flight
>>>>        #check for at least 2 entries
>>>>        if(dim(tf)[1] < 2) {
>>>>            next
>>>>        }
>>>>        idf = interpolateTimes(tf)
>>>>        times = rbind(times, idf)
>>>>    }
>>>>
>>>> # Interpolate the times to every minute for 60 minutes
>>>> # Return a new data frame
>>>> interpolateTimes = function(df) {
>>>>   x = as.numeric(seq(from=0,to=60)) # The times to interpolate to
>>>>   dti = approx(as.numeric(df$PredTime), as.numeric(df$dt), x,
>>>> method="linear",rule=1:1)
>>>>   # Make a new data frame of interpolated values
>>>>   idf = data.frame(time=dti$x, error=dti$y,
>>>>       runway=rep(df$lrw[1],length(dti$x)),
>>>> flight=rep(df$flightfact[1], length(dti$x)))
>>>>   return(idf)
>>>> }
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>
>
>