[BioC] objective criterion for identification of outlying arrays by pca

Kevin R. Coombes kevin.r.coombes at gmail.com
Fri Nov 4 17:32:56 CET 2011


Hi,

I did say it was a long time ago; but it was *so* long ago that I had 
forgotten that we did that analysis in MATLAB....  We have used the idea 
several times since then, so I tracked down the R code and put together 
an example for you.  (See the attached PDF file.) The only thing you 
need is the "mahalanobis" function included in the Sweave script that is 
attached.  As long as you have a routine that produces a "scores" slot 
when computing principal components, this function should work.

To actually run the script, you need the "ClassDiscovery" package from 
our OOMPA suite of packages, since that contains our implementation of 
PCA (in the SamplePCA function).  You can install this from the R 
repository at
     http://bioinformatics.mdanderson.org/OOMPA
by following the instructions at
      http://bioinformatics.mdanderson.org/Software/OOMPA

Best,
     Kevin

On 11/4/2011 8:48 AM, Richard Friedman wrote:
> Dear Kevin and List,
>
>     I read your paper with great interest but from the paper the 
> method seems to be implemented
> mainly in Matlab. I am not a Matlab user,  Is there a user-friendly R 
> version that can be used
> with no more R-scripting on the part of the user than is typical of 
> most bioconductor
> packages?
>
> Thanks and best wishes,
> Rich
> ------------------------------------------------------------
> Richard A. Friedman, PhD
> Associate Research Scientist,
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Lecturer,
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator,
> Center for Computational Biology and Bioinformatics (C2B2)/
> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
> Room 824
> Irving Cancer Research Center
> Columbia University
> 1130 St. Nicholas Ave
> New York, NY 10032
> (212)851-4765 (voice)
> friedman at cancercenter.columbia.edu
> http://cancercenter.columbia.edu/~friedman/
>
> I am a Bayesian. When I see a multiple-choice question on a test and I 
> don't
> know the answer I say "eeney-meaney-miney-moe".
>
> Rose Friedman, Age 14
>
>
>
>
>
>
>
> On Nov 2, 2011, at 11:12 AM, Kevin R. Coombes wrote:
>
>> The Mahalanobis distance (also known as Hotelling's T^2 statistic) 
>> from the center of a D-dimensional principal component space (under 
>> some sensible null hypothesis) should follow a chi-squared 
>> distribution with D degrees of freedom.  You can thus conduct a test 
>> for outliers based on the p-value associated with the chi-squared 
>> statistic.  (We used this idea for QC in a serum proteomics study a 
>> long time ago: Coombes et al, Clin Chem 2003; 49:1615-23.)
>>
>>    Kevin
>>
>> On 11/2/2011 9:11 AM, James W. MacDonald wrote:
>>> Hi Rich,
>>>
>>> On 11/2/2011 10:04 AM, Richard Friedman wrote:
>>>> Dear Bioconductor List,
>>>>
>>>>    Does anyone know of an objective criterion for the 
>>>> identification of outlying arrays
>>>> by pca?
>>>
>>> I don't know an objective criterion for this. However, unless the 
>>> 'outlier' is ridiculously bad, you might be better off using array 
>>> weights to down-weight the offending array(s). In limma, the 
>>> arrayWeights() and arrayWeightsSimple() functions allow you to 
>>> generate weights that you can then feed into lmFit().
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>>
>>>>    I usually do this subjectively. However the experimental 
>>>> investigator whom I am helping
>>>> has a different subjective sense than I do, so that I wonder if 
>>>> there is a hard-and-fast criterion.
>>>>
>>>> Thanks and best wishes,
>>>> Rich
>>>> ------------------------------------------------------------
>>>> Richard A. Friedman, PhD
>>>> Associate Research Scientist,
>>>> Biomedical Informatics Shared Resource
>>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>>> Lecturer,
>>>> Department of Biomedical Informatics (DBMI)
>>>> Educational Coordinator,
>>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
>>>> Room 824
>>>> Irving Cancer Research Center
>>>> Columbia University
>>>> 1130 St. Nicholas Ave
>>>> New York, NY 10032
>>>> (212)851-4765 (voice)
>>>> friedman at cancercenter.columbia.edu
>>>> http://cancercenter.columbia.edu/~friedman/
>>>>
>>>> I am a Bayesian. When I see a multiple-choice question on a test 
>>>> and I don't
>>>> know the answer I say "eeney-meaney-miney-moe".
>>>>
>>>> Rose Friedman, Age 14
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maha-test.pdf
Type: application/pdf
Size: 190231 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20111104/dfade0bb/attachment.pdf>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: maha-test.Rnw
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20111104/dfade0bb/attachment.pl>


More information about the Bioconductor mailing list