[R] Reading File Sizes: very slow!

Mon Sep 27 00:31:12 CEST 2021

On 9/27/2021 1:06 AM, Leonard Mada wrote:
>
> Dear Bill,
>
>
> Does list.files() always sort the results?
>
> It seems so. The option: full.names = FALSE does not have any effect: 
> the results seem always sorted.
>
>
> Maybe it is better to process the files in an unsorted order: as 
> stored on the disk?
>

After some more investigations:

This took only a few seconds:

sapply(list.dirs(path=path, full.name=F, recursive=F),
     function(f) length(list.files(path = paste0(path, "/", f), 
full.names = FALSE, recursive = TRUE)))

# maybe with caching, but the difference is enormous

Seems BH contains *by far* the most files: 11701 files.

But excluding it from processing did have only a liniar effect: still 377 s.

I had a look at src/main/platform.c, but do not fully understand it.

Sincerely,

Leonard

>
> Sincerely,
>
>
> Leonard
>
>
> On 9/25/2021 8:13 PM, Bill Dunlap wrote:
>> On my Windows 10 laptop I see evidence of the operating system 
>> caching information about recently accessed files.  This makes it 
>> hard to say how the speed might be improved.  Is there a way to clear 
>> this cache?
>>
>> > system.time(L1 <- size.f.pkg(R.home("library")))
>>    user  system elapsed
>>    0.48    2.81   30.42
>> > system.time(L2 <- size.f.pkg(R.home("library")))
>>    user  system elapsed
>>    0.35    1.10    1.43
>> > identical(L1,L2)
>> [1] TRUE
>> > length(L1)
>> [1] 30
>> > length(dir(R.home("library"),recursive=TRUE))
>> [1] 12949
>>
>> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help 
>> <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
>>
>>     Dear List Members,
>>
>>
>>     I tried to compute the file sizes of each installed package and the
>>     process is terribly slow.
>>
>>     It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
>>
>>
>>     1.) Package Sizes
>>
>>
>>     system.time({
>>              x = size.pkg(file=NULL);
>>     })
>>     # elapsed time: 509 s !!!
>>     # 512 Packages; 1.64 GB;
>>     # R 4.1.1 on MS Windows 10
>>
>>
>>     The code for the size.pkg() function is below and the latest
>>     version is
>>     on Github:
>>
>>     https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
>>     <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>
>>
>>
>>     Questions:
>>     Is there a way to get the file size faster?
>>     It takes long on Windows as well, but of the order of 10-20 s,
>>     not 10
>>     minutes.
>>     Do I miss something?
>>
>>
>>     1.b.) Alternative
>>
>>     It came to my mind to read first all file sizes and then use
>>     tapply or
>>     aggregate - but I do not see why it should be faster.
>>
>>     Would it be meaningful to benchmark each individual package?
>>
>>     Although I am not very inclined to wait 10 minutes for each new
>>     try out.
>>
>>
>>     2.) Big Packages
>>
>>     Just as a note: there are a few very large packages (in my list
>>     of 512
>>     packages):
>>
>>     1  123,566,287               BH
>>     2  113,578,391               sf
>>     3  112,252,652            rgdal
>>     4   81,144,868           magick
>>     5   77,791,374 openNLPmodels.en
>>
>>     I suspect that sf & rgdal have a lot of duplicated data structures
>>     and/or duplicate code and/or duplicated libraries - although I am
>>     not an
>>     expert in the field and did not check the sources.
>>
>>
>>     Sincerely,
>>
>>
>>     Leonard
>>
>>     =======
>>
>>
>>     # Package Size:
>>     size.f.pkg = function(path=NULL) {
>>          if(is.null(path)) path = R.home("library");
>>          xd = list.dirs(path = path, full.names = FALSE, recursive =
>>     FALSE);
>>          size.f = function(p) {
>>              p = paste0(path, "/", p);
>>              sum(file.info <http://file.info>(list.files(path=p,
>>     pattern=".",
>>                  full.names = TRUE, all.files = TRUE, recursive =
>>     TRUE))$size);
>>          }
>>          sapply(xd, size.f);
>>     }
>>
>>     size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
>>          x = size.f.pkg(path=path);
>>          x = as.data.frame(x);
>>          names(x) = "Size"
>>          x$Name = rownames(x);
>>          # Order
>>          if(sort) {
>>              id = order(x$Size, decreasing=TRUE)
>>              x = x[id,];
>>          }
>>          if( ! is.null(file)) {
>>              if( ! is.character(file)) {
>>                  print("Error: Size NOT written to file!");
>>              } else write.csv(x, file=file, row.names=FALSE);
>>          }
>>          return(x);
>>     }
>>
>>     ______________________________________________
>>     R-help using r-project.org <mailto:R-help using r-project.org> mailing list
>>     -- To UNSUBSCRIBE and more, see
>>     https://stat.ethz.ch/mailman/listinfo/r-help
>>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>>     PLEASE do read the posting guide
>>     http://www.R-project.org/posting-guide.html
>>     <http://www.R-project.org/posting-guide.html>
>>     and provide commented, minimal, self-contained, reproducible code.
>>

	[[alternative HTML version deleted]]