[R] Help merging large datasets in R
Ebert,Timothy Aaron
tebert @end|ng |rom u||@edu
Thu May 8 17:24:56 CEST 2025
Fair enough.
Apologies for the oversight.
It is possible to get R to read the variable names from the file and then run an analysis of whatever those names are without specifying variable names in the code. The names can be anything the client wants because I never type in the name. I am just starting to try this approach, looping through all the variable names, and generating a formula from each index value. I want a univariate model for each variable in the dataset (or something like that).
I started in SAS. My default approach is to munge the names in Excel and then at the end un-munge them (demunge them?). Sometimes the client does not care because the changes are simple. 'pound solids per box' becomes pound_solids_per_box and '% infected' becomes Pct_infected where the client has already used "pct" as a short form for percent in some other variable.
I get confused easily. I could have a variable 'Percent E2 > 100' and that gets put into a formula that becomes 'Percent E2 > 100' = 'Number of C < 50' + Time to 1st E2 > 100' and I find the small tick marks look like a bit of dirt on my computer screen and everything goes down hill from there. If I do not start down this road, then I have fewer problems later on. I might make this PrcntE2Gt100 = NumCLt50 + TmFrstE2Gt100, but that is just my preference.
Most programming languages restrict variable names to alphanumeric characters, and I seem to recall that the name should not have a number as the first character. Putting variable names in quotes to allow for other characters is a workaround. Without that the computer will start by looking for a variable Percent if it encounters "Percent E2 > 100 = Number of C < 50 + Time to 1st E2 > 100" and failing to find that variable it returns an error.
ChatGPT gives these rules for variable names in R:
1) Start with a letter (A-Z, a-z) or a dot (.) not followed by a number.
2) Contain only letters, numbers, dots (.), and underscores (_).
3) Not contain spaces.
4) Not be a reserved word (like if, for, TRUE, NULL, etc.).
5) Avoid starting with a digit or special characters like @, $, %, etc.
6) You can avoid these restrictions by putting variables in backticks. In a limited number of cases putting variables into quotes will also work: df[["bad name"]] or assign("bad name", 5). Treating a variable name as a string is very useful in some cases. Say I have a data frame with 1000 variables, and I want to make a plot of all variables where the variable name includes "E2". This is one line of code in R: E2_vars <- df[ , grep("E2", names(df))]
When I was learning programming in the 80's the rules were more restrictive and there was no workaround. I am thankful that variable names can now be longer than 8 characters. I would suggest trying very hard to work within the outlined rules 1 through 5. Using backticks to enable invalid names can make code harder to read and debug. While it makes my brain hurt, if the aesthetic is to make things difficult then:
library(ggplot2)
df <- data.frame(
check.names = FALSE,
`mean(x)` = rnorm(100, mean = 5),
`log(y + 1)` = rnorm(100, mean = 2),
group = sample(c("A", "B"), 100, replace = TRUE)
)
print(names(df)) # "mean(x)", "log(y + 1)", "group"
ggplot(df, aes(x = `mean(x)`, y = `log(y + 1)`, color = group)) +
geom_point() +
labs(title = "Obfuscate Fake Data")
To have more fun, log transform these variables before plotting so that parts of the variable names are also part of the valid code. For me, coding is hard enough without additional complexity.
Tim
-----Original Message-----
From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
Sent: Wednesday, May 7, 2025 8:29 PM
To: r-help using r-project.org; Ebert,Timothy Aaron <tebert using ufl.edu>; Deelen, M. (Mirjam) <m.deelen2 using student.vu.nl>; r-help using r-project.org
Subject: Re: [R] Help merging large datasets in R
[External Email]
> Variable names cannot have spaces
Please soften your words... variables can have all sorts of characters including spaces in them, but it can be inconvenient to quote them all with back-tick quotes like `merged 1` so where possible most people avoid variable names with weird characters.
People also often have to work with column names provided by an external source. It can be harder to explain to people familiar with the original data that you renamed all the columns they see because it was inconvenient not to. If you have that luxury, fine... but R can absolutely avoid munging column names if you use the relevant import parameters for your preferred import function.
On May 7, 2025 1:46:47 PM PDT, "Ebert,Timothy Aaron" <tebert using ufl.edu> wrote:
>Some issues:
>1) Variable names cannot have spaces. "merged 1" is not valid but "merged_1" is a valid alternative.
>2) You need to tell R what to merge by. It looks like you may be using data tables rather than a data frame.
>merged <- dataset2[dataset1, on = "id", nomatch = NA]
>
>3) Alternatively: join functions from the dplyr package, cbind(), and rbind() can be used in different ways to combine data.
>4) Make sure the process is successful. It looked like 200K + 600K + 2000K + 2000K = 4000K which is obviously wrong. If I merge 200 k and 600k and get back 600k, I am either not telling the whole story or I made an error. Maybe I have 200k observations of 10 variables and 600k observations of 4 variables and I get an object of 600k with 13 variables (one is lost because it was present in both and used to merge the two data sets).
>
>Please try again. Possibly start by extracting the first ten rows from each data set. Try merging that where it is easy to check. Then expand to using all the data. Your approach should work with a little modification. Alternatively make a couple of small fake data sets and play with those. Once you have the code correct there should be no problem expanding to the real data.
>
>Tim
>
>-----Original Message-----
>From: R-help <r-help-bounces using r-project.org> On Behalf Of Deelen, M.
>(Mirjam) via R-help
>Sent: Wednesday, May 7, 2025 2:54 AM
>To: r-help using r-project.org
>Subject: [R] Help merging large datasets in R
>
>[External Email]
>
>Hi guys,
>For my MSc. thesis i am using R studio. The goal is for me to merge a couple (6) of relatively large datasets (min of 200.000 and max of 2mil rows). I have now been able to do so, however I think something might be going wrong in my codes.
>For reference, i have a dataset 1 (200.000), dataset 2 (600.000), dataset 3 (2mil) and dataset 4 (2mil) merged into one dataset of 4mil, and dataset 5 (4mil) and dataset 6 (4mil) merged into one dataset of 8mil.
>What i have done so far is the following:
>
> * Merged dataset 1 and dataset 2 using the following code = merged 1 <- dataset 2[dataset 1, nomatch = NA]. This results in a dataset of 600.000 (looks to be alright).
> * Merged the dataset merged 1 and datasets 3/4 using the following code = merged 2 <- dataset 3/4[merged 1, nomatch = NA, allow.cartesian = TRUE]. This results in a dataset of 21mil (as expected). To this i have applied an additional criteria (dates in dataset 3/4 should be within 365 days of the dates in merged 1), which reduces merged 2 to around 170.000.
> * Merged the dataset merged 2 and datasets 5/6 using the following code = merged 3 <- dataset 5/6[merged 2, nomatch = NA, allow.cartesian = TRUE]. Again, this results in a dataset of 8mil (as expected). And again, to this i have applied an additional criteria (dates in dataset 5/6 should be within 365 days of the dates in merged 2), which reduces merged 3 to around 50.000.
>
>What I'm now thinking, is how can the merging + additional criteria lead to such a loss of cases ?? The first merge, of dataset 1 and dataset 2, results in an amount that I think should be the final amount of cases. I understand that by adding an additional criteria the number of possible matches when merging datasets 3/4 and 5/6 is reduced, but I'm not sure this should lead to SUCH a loss. Besides this, the additional criteria was added to reduce the duplication of information that is now happening when merging datasets 3/4 and 5/6.
>All cases appear once in dataset 1, but could appear a couple more times in the following datasets (say twice in dataset 2, four times in datasets 3/4 and 8 times in datasets 5/6). Which results in a 1 x 2 x 4 x 8 duplication of information when merging the datasets without additional criteria.
>So sum this up, my questions are=
>
> * Are there any tips as to not have this duplication ? (so I can drop the additonal criteria and the final amount of cases, probably, increases).
> * Or are there any tips as to figure out where in these steps cases are lost ?
>
>Thanks!
>Mirjam
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat/.
>ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7
>Cc7324be5d30d4e737f6b08dd8dc75d08%7C0d4da0f84a314d76ace60a62331e1b84%7C
>0%7C0%7C638822609607262385%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
>dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3
>D%7C0%7C%7C%7C&sdata=aFe%2FQZ0sbdUUMVWtmO5IEny%2FGDfumAO99uuQFyKIjZg%3D
>&reserved=0 PLEASE do read the posting guide
>https://www.r/
>-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc73
>24be5d30d4e737f6b08dd8dc75d08%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C
>0%7C638822609607277428%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs
>IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C
>0%7C%7C%7C&sdata=iCNTOoN%2FDet7sxrSHYd3FoEOEx5mHGPHO8PR9NOGxjg%3D&reser
>ved=0 and provide commented, minimal, self-contained, reproducible
>code.
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat/.
>ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7
>Cc7324be5d30d4e737f6b08dd8dc75d08%7C0d4da0f84a314d76ace60a62331e1b84%7C
>0%7C0%7C638822609607285873%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
>dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3
>D%7C0%7C%7C%7C&sdata=5gEvnpDP9A%2BXqltIbk6eKfM7HNzWNC%2BswxeyjkfVVNc%3D
>&reserved=0 PLEASE do read the posting guide
>https://www.r/
>-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc73
>24be5d30d4e737f6b08dd8dc75d08%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C
>0%7C638822609607294283%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs
>IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C
>0%7C%7C%7C&sdata=pKvzz%2FHe6C%2BuP2Jv9S77AogSSq6xTNhTmf3rt5K7D9c%3D&res
>erved=0 and provide commented, minimal, self-contained, reproducible
>code.
--
Sent from my phone. Please excuse my brevity.
More information about the R-help
mailing list