[R] scale or not to scale that is the question - prcomp

Wed Aug 19 17:09:21 CEST 2009

Ok

Thank you for your time.

Best regards
Petr Pikal

Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 16:29:07:

> On 8/19/2009 10:14 AM, Petr PIKAL wrote:
> > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 15:25:00:
> > 
> >> On 19/08/2009 9:02 AM, Petr PIKAL wrote:
> >> > Thank you
> >> > 
> >> > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 
14:49:52:
> >> > 
> >> >> On 19/08/2009 8:31 AM, Petr PIKAL wrote:
> >> >>> Dear all
> >> >>>
> >> > 
> >> > <snip>
> >> > 
> >> >> I would say the answer depends on the meaning of the variables. In 

> > the 
> >> >> unusual case that they are measured in dimensionless units, it 
might 
> >> >> make sense not to scale.  But if you are using arbitrary units of 
> >> >> measurement, do you want your answer to depend on them?  For 
example, 
> > if 
> >> > 
> >> >> you change from Kg to mg, the numbers will become much larger, the 

> >> >> variable will contribute much more variance, and it will become a 
> > more 
> >> >> important part of the largest principal component.  Is that 
sensible?
> >> > 
> >> > Basically variables are in percentages (all between 0 and 6%) 
except 
> > dus 
> >> > which is present or not present (for the purpose of prcomp 
transformed 
> > to 
> >> > 0/1 by as.numeric:). The only variable which is not such is iep 
which 
> > is 
> >> > basically in range 5-8. So ranges of all variables are quite 
similar. 
> >> > 
> >> > What surprises me is that biplot without scaling I can interpret by 

> > used 
> >> > variables while biplot with scaling is totally different and those 
two 
> > 
> >> > pictures does not match at all. This is what surprised me as I 
would 
> >> > expected just a small difference between results from those two 
> > settings 
> >> > as all numbers are quite comparable and does not differ much.
> >> 
> >> 
> >> If you look at the standard deviations in the two cases, I think you 
can 
> > 
> >> see why this happens:
> >> 
> >> Scaled:
> >> 
> >> Standard deviations:
> >> [1] 1.3335175 1.2311551 1.0583667 0.7258295 0.2429397
> >> 
> >> Not Scaled:
> >> 
> >> Standard deviations:
> >> [1] 1.0030048 0.8400923 0.5679976 0.3845088 0.1531582
> >> 
> >> 
> >> The first two sds are close, so small changes to the data will affect 

> > 
> > I see. But I would expect that changes to data made by scaling would 
not 
> > change it in such a way that unscaled and scaled results are 
completely 
> > different.
> > 
> >> their direction a lot.  Your biplots look at the 2nd and 3rd 
components.
> > 
> > Yes because grouping in 2nd and 3rd component biplot can be easily 
> > explained by values of some variables (without scaling). 
> > 
> > I must admit that I do not use prcomp much often and usually scaling 
can 
> > give me "explainable" result, especially if I use it to "variable 
> > reduction". Therefore I am reluctant to use it in this case.
> > 
> > when I try "more standard" way
> > 
> >> fit<-lm(iep~sio2+al2o3+p2o5+as.numeric(dus), data=rglp)
> >> summary(fit)
> > 
> > Call:
> > lm(formula = iep ~ sio2 + al2o3 + p2o5 + as.numeric(dus), data = rglp)
> > 
> > Residuals:
> >      Min       1Q   Median       3Q      Max 
> > -0.41751 -0.15568 -0.03613  0.20124  0.43046 
> > 
> > Coefficients:
> >                 Estimate Std. Error t value Pr(>|t|) 
> > (Intercept)      7.12085    0.62257  11.438 8.24e-08 ***
> > sio2            -0.67250    0.20953  -3.210 0.007498 ** 
> > al2o3            0.40534    0.08641   4.691 0.000522 ***
> > p2o5            -0.76909    0.11103  -6.927 1.59e-05 ***
> > as.numeric(dus) -0.64020    0.18101  -3.537 0.004094 ** 
> > 
> > I get quite plausible result which can be interpreted without 
problems.
> > 
> > My data is a result of designed experiment (more or less :) and 
therefore 
> > all variables are significant. Is that the reason why scaling may bye 
> > inappropriate in this case?
> 
> No, I think it's just that the cloud of points is approximately 
> spherical in the first 2 or 3 principal components, so the principal 
> component directions are somewhat arbitrary.  You just got lucky that 
> the 2nd and 3rd components are interpretable:  I wouldn't put too much 
> faith in being able to repeat that if you went out and collected a new 
> set of data using the same design.
> 
> Duncan Murdoch