[R] Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

João Azevedo Patrício joao.patricio at gmx.pt
Wed Jul 9 13:49:44 CEST 2014


Em 05-07-2014 00:43, John McKown escreveu:
> I messed up my original response by not including r-help in the
> distribution. And now I won't look as bad because, after a short nap,
> I have new, much shorted (but more difficult, for me, to understand)
> answer.
>
> #
> # The original data is in the variable "x".
> z=data.frame(TC=x$TC,
> WC=I(mapply(strsplit,x$WC,MoreArgs=list(';'),USE.NAMES=FALSE)));
> result=data.frame(TC=rep(x$TC,sapply(z$WC,length)),WC=unlist(z$WC));
> #
>
> There may be a way to eliminate the temporary variable "z". Maybe I
> need another nap!
>
> The heart of this is the mapply, which results in a list where each
> entry in the list is another list. And the entries in embedded list
> are the list of results from the output of strsplit() on the WC
> information.
>
> If this needs to be a function, then
>
> splitUp <- function(x) {
>      z=data.frame(TC=x$TC,
> WC=I(mapply(strsplit,x$WC,MoreArgs=list(';'),USE.NAMES=FALSE)));
>      result=data.frame(TC=rep(x$TC,sapply(z$WC,length)),WC=unlist(z$WC));
>      return(result);
> }
>
> Then invoke it with:
>
> flattened.result <- splitUp(original.data.frame);
>
> On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
> <joao.patricio  gmx.pt> wrote:
>> Hi,
>>
>> I've been trying to solve this issue but with no success.
>>
>> I have some data like this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation; Nuclear Science & Technology;
>> Physics, Particles & Fields; Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
>> Physics, Applied
>> 4 > 2   Physics, Nuclear; Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical; Materials Science, Multidisciplinary;
>> Metallurgy & Metallurgical Engineering
>>
>> And I need to have this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation
>> 2 > 0   Nuclear Science & Technology
>> 2 > 0   Physics, Particles & Fields
>> 2 > 0   Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology
>> 3 > 0   Materials Science, Multidisciplinary
>> 3 > 0   Physics, Applied
>> 4 > 2   Physics, Nuclear
>> 4 > 2   Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical
>> 6 > 2   Materials Science, Multidisciplinary
>> 6 > 2   Metallurgy & Metallurgical Engineering
>>
>> This means repeat the row for each element in WC and keeping the same value
>> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
>> multiple.
>>
>> i've tried to separate the column using strsplt but then I cannot keep the
>> track of TC.
>>
>> thanks in advance.
>> --
>> João Azevedo Patrício
I've been testing it and the results is coming nicely.

It grabs a CSV taken from ISI Web Of science, works it out and produces 
a table organized by WC (web of science category) with number of papers 
per area, citations and impact factor.

my code is like this right now:

 > isi <- read.table("file.csv", header = TRUE, sep=";") ##get citations 
and web of science categories file
 > isisplit=data.frame(TC=isi$TC,
+ WC=I(mapply(strsplit,isi$WC,MoreArgs=list(';'),USE.NAMES=FALSE)));
 > 
result=data.frame(TC=rep(isi$TC,sapply(isisplit$WC,length)),WC=unlist(isisplit$WC));
 > isisplit$WC <- str_trim(isisplit$WC)
 > wccitations <- aggregate (isisplit$TC, by=list(Category=isisplit$WC), 
FUN = sum) ## creates a table with the list of WCategories and the 
specific + citations
 > colnames(wccitations) <- c("WC", "TC")
 > wcproduction <- table(isisplit$WC) ## creates a table with the number 
of pubs by WCategories
 > wcproduction <- as.data.table(wcproduction)
 > colnames(wcproduction) <- c("WC", "PUB")
 >wc <- data.frame(WC = wccitations$WC, PUB = wcproduction$PUB, TC = 
wccitations$TC, IMP = round((wcproduction$PUB/wccitations$TC), digits = 
+ 2))
 > wc[wc == Inf] = 0 ## removes inf in impact by impact 0
 > write.table(wc, file = "file.csv", sep = ";", dec = ",")


-- 
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"



More information about the R-help mailing list