[R] Find "undirected" duplicates in a tibble
Kimmo Elo
k|mmo@e|o @end|ng |rom utu@||
Fri Aug 20 10:59:34 CEST 2021
Hi!
I am working with a large network data consisting of source-target
pairs stored in a tibble. Now I need to transform the directed dataset
to an undirected network data. This means, I need to keep only one
instance for pairs with the same "nodes". In other words, if my data
has one row with A (source) and B (target) and one with B (source) and
A (target), only the pair A-B should be kept.
Here an example how I have solved this problem so far:
--- snip ---
# Create some data
x<-tibble(Source=rep(1:3,4), Target=c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)))
x # print original data
# Remove "undirected" duplicates
x<-x %>% mutate(pair=mapply(function(x,y)
paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>% distinct(pair,
.keep_all = T) %>% mutate(Source=sapply(pair, function(x)
unlist(strsplit(x, split="-"))[1]), Target=sapply(pair, function(x)
unlist(strsplit(x, split="-"))[2])) %>% select(-pair)
x # print cleaned data
--- snip ---
The good thing with my own solution is that it allows the creation of
weighted pairs as well. One just needs to replace 'distinct(pair,
.keep_all=T)' with 'count(pair)'.
I have done a lot of searching but not found any function providing
this functionality. Does someone know an alternative, maybe a more
effective function/solution?
Best,
Kimmo Elo
--
Dr. Kimmo Elo
Senior researcher in European Studies
=====================================================
University of Turku
Centre for Parliamentary Studies
Finland
E-mail: kimmo.elo using utu.fi
=====================================================
More information about the R-help
mailing list