[R] Remove highly correlated variables from a data frame or matrix

Sat Nov 16 03:01:41 CET 2019

Try hclust(as.dist(1-calc.rho), method = "average").

Peter

On Fri, Nov 15, 2019 at 10:02 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>
> HI Peter,
>
> Thank you for getting back to me and shedding light on this. I see
> your point, doing Jim's method:
>
> > keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3))
> > ro246.lt.8<-calc.rho[keeprows,keeprows]
> > ro246.lt.8[ro246.lt.8 == 1] <- NA
> > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE))
> [1] 0.566
>
> Which is good in general, correlations in my matrix  should not be
> exceeding 0.8. I need to run Mendelian Rendomization on it later on so
> I can not be having there highly correlated SNPs. But with Jim's
> method I am only left with 17 SNPs (out of 246) and that means that
> both pairs of highly correlated SNPs are removed and it would be good
> to keep one of those highly correlated ones.
>
> I tried to do your code:
> > tree = hclust(1-calc.rho, method = "average")
> Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor
> exceed 65536") :
>   missing value where TRUE/FALSE needed
>
> Please advise.
>
> Thanks
> Ana
>
> On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder
> <peter.langfelder using gmail.com> wrote:
> >
> > I suspect that you want to identify which variables are highly
> > correlated, and then keep only "representative" variables, i.e.,
> > remove redundant ones. This is a bit of a risky procedure but I have
> > done such things before as well sometimes to simplify large sets of
> > highly related variables. If your threshold of 0.8 is approximate, you
> > could simply use average linkage hierarchical clustering with
> > dissimilarity = 1-correlation, cut the tree at the appropriate height
> > (1-0.8=0.2), and from each cluster keep a single representative (e.g.,
> > the one with the highest mean correlation with other members of the
> > cluster). Something along these lines (untested)
> >
> > tree = hclust(1-calc.rho, method = "average")
> > clusts = cutree(tree, h = 0.2)
> > clustLevels = sort(unique(clusts))
> > representatives = unlist(lapply(clustLevels, function(cl)
> > {
> >   inClust = which(clusts==cl);
> >   rho1 = calc.rho[inClust, inClust, drop = FALSE];
> >   repr = inClust[ which.max(colSums(rho1)) ]
> >   repr
> > }))
> >
> > the variable representatives now contains indices of the variables you
> > want to retain, so you could subset the calc.rho matrix as
> > rho.retained = calc.rho[representatives, representatives]
> >
> > I haven't tested the code and it may contain bugs, but something along
> > these lines should get you where you want to be.
> >
> > Oh, and depending on how strict you want to be with the remaining
> > correlations, you could use complete linkage clustering (will retain
> > more variables, some correlations will be above 0.8).
> >
> > Peter
> >
> > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I have a data frame like this (a matrix):
> > > head(calc.rho)
> > >             rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995
> > > rs56192520      0.903     0.268     0.327     0.327     0.327     0.582
> > > rs3764410       0.928     0.276     0.336     0.336     0.336     0.598
> > > rs145984817     0.975     0.309     0.371     0.371     0.371     0.638
> > > rs1807401       0.975     0.309     0.371     0.371     0.371     0.638
> > > rs1807402       0.975     0.309     0.371     0.371     0.371     0.638
> > > rs35350506      0.975     0.309     0.371     0.371     0.371     0.638
> > >
> > > > dim(calc.rho)
> > > [1] 246 246
> > >
> > > I would like to remove from this data all highly correlated variables,
> > > with correlation more than 0.8
> > >
> > > I tried this:
> > >
> > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))]
> > > > dim(data)
> > > [1] 246   0
> > >
> > > Can you please advise,
> > >
> > > Thanks
> > > Ana
> > >
> > > But this removes everything.
> > >
> > > ______________________________________________
> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.