[R] About size of data frames

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Thu Aug 14 20:55:34 CEST 2025


Rui, et. al. :
"real results probably depend on the functions
you want to apply to the data."

Indeed!
I would presume that one would want to analyze such data as time series of
some sort, for which I think long form is inherently "more sensible". If
so, I also would think that you would want columns for data, sensor*,
season, and day, as Rui suggested.  However, note that this presumes no
missing data, which is usually wrong. To handle this, within each day the
rows would need to be in order of the hour the data was recorded (I assume
twice per hour) with a missing code when data was missing.

*As an aside, whether <data, season, day> data is in the form of a single
data frame with an additional sensor ID column or a list of 70 frames, one
for each sensor,  is not really much of an issue these days, where
gigabytes and gigaflops are cheap and available, as it is trivial to
convert from one form to another as needed.

Feel free to disagree -- I am just amplifying Rui's comment above; "what I
would presume" and "what I think" doesn't matter. What matters is Stefano's
response to his comment.

Cheers,
Bert

On Thu, Aug 14, 2025 at 10:54 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:

> On 8/14/2025 12:27 PM, Stefano Sofia via R-help wrote:
> > Dear R-list users,
> >
> > let me ask you a very general question about performance of big data
> frames.
> >
> > I deal with semi-hourly meteorological data of about 70 sensors during
> 28 winter seasons.
> >
> >
> > It means that for each sensor I have 48 data for each day, 181 days for
> each winter season (182 in case of leap year): 48 * 181 * 28 = 234,576
> >
> > 234,576 * 70 = 16420320
> >
> >
> >  From the computational point of view it is better to deal with a single
> data frame of approximately 16.5 M rows and 3 columns (one for data, one
> for sensor code and one for value), with a single data frame of
> approximately 235,000 rows and 141 rows or 70 different data frames of
> approximately 235,000 rows and 3 rows? Or it doesn't make any difference?
> >
> > I personally would prefer the first choice, because it would be easier
> for me to deal with a single data frame and few columns.
> >
> >
> > Thank you for your usual help
> >
> > Stefano
> >
> >
> >           (oo)
> > --oOO--( )--OOo--------------------------------------
> > Stefano Sofia MSc, PhD
> > Civil Protection Department - Marche Region - Italy
> > Meteo Section
> > Snow Section
> > Via Colle Ameno 5
> > 60126 Torrette di Ancona, Ancona (AN)
> > Uff: +39 071 806 7743
> > E-mail: stefano.sofia using regione.marche.it
> > ---Oo---------oO----------------------------------------
> >
> > ________________________________
> >
> > AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu� contenere
> informazioni confidenziali, pertanto � destinato solo a persone autorizzate
> alla ricezione. I messaggi di posta elettronica per i client di Regione
> Marche possono contenere informazioni confidenziali e con privilegi legali.
> Se non si � il destinatario specificato, non leggere, copiare, inoltrare o
> archiviare questo messaggio. Se si � ricevuto questo messaggio per errore,
> inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio
> computer. Ai sensi dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla
> DGR n. 74/2021, si segnala che, in caso di necessit� ed urgenza, la
> risposta al presente messaggio di posta elettronica pu� essere visionata da
> persone estranee al destinatario.
> > IMPORTANT NOTICE: This e-mail message is intended to be received only by
> persons entitled to receive the confidential information it may contain.
> E-mail messages to clients of Regione Marche may contain information that
> is confidential and legally privileged. Please do not read, copy, forward,
> or store this message unless you are an intended recipient of it. If you
> have received this message in error, please forward it to the sender and
> delete it completely from your computer system.
> >
> >       [[alternative HTML version deleted]]
> >
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> Hello,
>
> First of all, 48 * 181 * 28 = 243,264, not 234,576.
> And 243264 * 70 = 17,028,480.
>
> As for the question, why don't you try it with smaller data sets?
> In the test bellow I have tested with the sizes you have posted and the
> many columns (wide format) is fastest. Then the df's list, then the 4
> columns (long format).
> 4 columns because it's sensor, day, season and data.
> And the wide format df is only 72 columns wide, one for day, one for
> season and one for each sensor.
>
> The test computes mean values aggregated by day and season. When the
> data is in the long format it must also include the sensor, so there is
> an extra aggregation column.
>
> The test is very simple, real results probably depend on the functions
> you want to apply to the data.
>
>
>
> # create the test data
> makeDataLong <- function(sensor, x) {
>    x[["data"]] <- rnorm(nrow(df1))
>    cbind.data.frame(sensor, x)
> }
>
> makeDataWide <- function(sensor, x) {
>    x[[sensor]] <- rnorm(nrow(x))
>    x
> }
>
> set.seed(2025)
>
> n_per_day <- 48
> n_days <- 181
> n_seasons <- 28
> n_sensors <- 70
>
> day <- rep(1:n_days, each = n_per_day)
> season <- 1:n_seasons
> sensor_names <- sprintf("sensor_%02d", 1:n_sensors)
> df1 <- expand.grid(day = day, season = season, KEEP.OUT.ATTRS = FALSE)
>
> df_list <- lapply(1:n_sensors, makeDataLong, x = df1)
> names(df_list) <- sensor_names
> df_long <- lapply(1:n_sensors, makeDataLong, x = df1) |> do.call(rbind,
> args = _)
> df_wide <- df1
> for(s in sensor_names) {
>    df_wide <- makeDataWide(s, df_wide)
> }
>
>
> # test functions
> f <- function(x) aggregate(data ~ season + day, data = x, mean)
> g <- function(x) aggregate(data ~ sensor + season + day, data = x, mean)
> h <- function(x) aggregate(. ~ season + day, x, mean)
>
> # timings
> bench::mark(
>    list_base = lapply(df_list, f),
>    long_base = g(df_long),
>    wide_base = h(df_wide),
>    check = FALSE
> )
>
>
>
> Hope this helps,
>
> Rui Barradas
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list