[R] Reading File Sizes: very slow!
Leonard Mada
|eo@m@d@ @end|ng |rom @yon|c@eu
Sun Sep 26 15:03:06 CEST 2021
Dear Bill,
- using the Ms Windows Properties: ~ 15 s;
[Windows new start, 1st operation, bulk size]
- using R / file.info() (2nd operation): still 523.6 s
[and R seems mostly unresponsive during this time]
Unfortunately, I do not know how to clear any cache.
[The cache may play a role only for smaller sizes? But I am rather not
inclined to run the ~ 10 minutes procedure multiple times.]
Sincerely,
Leonard
On 9/26/2021 5:49 AM, Richard O'Keefe wrote:
> On a $150 second-hand laptop with 0.9GB of library,
> and a single-user installation of R so only one place to look
> LIBRARY=$HOME/R/x86_64-pc-linux-gnu-library/4.0
> cd $LIBRARY
> echo "kbytes package"
> du -sk * | sort -k1n
>
> took 150 msec to report the disc space needed for every package.
>
> That'
>
> On Sun, 26 Sept 2021 at 06:14, Bill Dunlap <williamwdunlap using gmail.com> wrote:
>> On my Windows 10 laptop I see evidence of the operating system caching
>> information about recently accessed files. This makes it hard to say how
>> the speed might be improved. Is there a way to clear this cache?
>>
>>> system.time(L1 <- size.f.pkg(R.home("library")))
>> user system elapsed
>> 0.48 2.81 30.42
>>> system.time(L2 <- size.f.pkg(R.home("library")))
>> user system elapsed
>> 0.35 1.10 1.43
>>> identical(L1,L2)
>> [1] TRUE
>>> length(L1)
>> [1] 30
>>> length(dir(R.home("library"),recursive=TRUE))
>> [1] 12949
>>
>> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help <
>> r-help using r-project.org> wrote:
>>
>>> Dear List Members,
>>>
>>>
>>> I tried to compute the file sizes of each installed package and the
>>> process is terribly slow.
>>>
>>> It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
>>>
>>>
>>> 1.) Package Sizes
>>>
>>>
>>> system.time({
>>> x = size.pkg(file=NULL);
>>> })
>>> # elapsed time: 509 s !!!
>>> # 512 Packages; 1.64 GB;
>>> # R 4.1.1 on MS Windows 10
>>>
>>>
>>> The code for the size.pkg() function is below and the latest version is
>>> on Github:
>>>
>>> https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
>>>
>>>
>>> Questions:
>>> Is there a way to get the file size faster?
>>> It takes long on Windows as well, but of the order of 10-20 s, not 10
>>> minutes.
>>> Do I miss something?
>>>
>>>
>>> 1.b.) Alternative
>>>
>>> It came to my mind to read first all file sizes and then use tapply or
>>> aggregate - but I do not see why it should be faster.
>>>
>>> Would it be meaningful to benchmark each individual package?
>>>
>>> Although I am not very inclined to wait 10 minutes for each new try out.
>>>
>>>
>>> 2.) Big Packages
>>>
>>> Just as a note: there are a few very large packages (in my list of 512
>>> packages):
>>>
>>> 1 123,566,287 BH
>>> 2 113,578,391 sf
>>> 3 112,252,652 rgdal
>>> 4 81,144,868 magick
>>> 5 77,791,374 openNLPmodels.en
>>>
>>> I suspect that sf & rgdal have a lot of duplicated data structures
>>> and/or duplicate code and/or duplicated libraries - although I am not an
>>> expert in the field and did not check the sources.
>>>
>>>
>>> Sincerely,
>>>
>>>
>>> Leonard
>>>
>>> =======
>>>
>>>
>>> # Package Size:
>>> size.f.pkg = function(path=NULL) {
>>> if(is.null(path)) path = R.home("library");
>>> xd = list.dirs(path = path, full.names = FALSE, recursive = FALSE);
>>> size.f = function(p) {
>>> p = paste0(path, "/", p);
>>> sum(file.info(list.files(path=p, pattern=".",
>>> full.names = TRUE, all.files = TRUE, recursive = TRUE))$size);
>>> }
>>> sapply(xd, size.f);
>>> }
>>>
>>> size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
>>> x = size.f.pkg(path=path);
>>> x = as.data.frame(x);
>>> names(x) = "Size"
>>> x$Name = rownames(x);
>>> # Order
>>> if(sort) {
>>> id = order(x$Size, decreasing=TRUE)
>>> x = x[id,];
>>> }
>>> if( ! is.null(file)) {
>>> if( ! is.character(file)) {
>>> print("Error: Size NOT written to file!");
>>> } else write.csv(x, file=file, row.names=FALSE);
>>> }
>>> return(x);
>>> }
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list