[BioC] Filtering out duplicate probes in Affy data

Mon Jan 14 16:33:15 CET 2013

Hi Himanshu,

On 1/12/2013 10:27 AM, Himanshu Sharma wrote:
> Thanks a lot James. I really appreciate your help. Also, when I annotate the ids, there should be equal number of probes as after filtering.? How do I lose more when I annotate them?.

Two possible reasons. First, having an Entrez Gene ID doesn't 
necessarily imply having a gene symbol. Second, and more likely, the 
moe430a.db package masks all probe -> symbol mappings where there are 
multiple symbols.

But exposing multiple probe -> symbol mappings adds an additional level 
of complexity. As an example:

 > library(moe430a.db)
 > x <- toggleProbes(moe430aSYMBOL, "all")
 > x <- as.list(x)
 > x[sapply(x, length) > 1][1:10]
$`1415716_a_at`
[1] "Rps27"  "Gm9846"

$`1415763_a_at`
[1] "Tmem234"      "LOC100505293"

$`1415781_a_at`
[1] "Sumo2"   "Gm13430"

$`1415788_at`
[1] "Gm12663" "Ublcp1"

$`1415789_a_at`
[1] "Gm12663" "Ublcp1"

$`1415790_at`
[1] "Gm12663" "Ublcp1"

$`1415825_s_at`
[1] "Slc38a10" "Csnk1d"

$`1415875_at`
[1] "Fam60a"        "3010003L21Rik"

$`1415895_at`
[1] "Snrpn" "Snurf"

$`1415896_x_at`
[1] "Snrpn" "Snurf"

So now which symbol do you use? The first one? Is 1415896_x_at Snrpn or 
Snurf? Are these symbols synonyms? Without checking each one you are 
left with ad hoc decisions. You could just use the first one, but then 
you choose Gm12663 over Ublcp1 for probeset 1415790_at, which looks like 
the opposite of what you should be doing.

As they say, ignorance is bliss. The more you know about this stuff, the 
messier it gets, and the less clear it is what the 'right' thing to do 
might be. Or maybe that should be the 'right' thing to do in light of 
limited time to spend chasing your tail.

Best,

Jim

> Thanks,
> Himanshu
> From: James W. MacDonald [jmacdon at uw.edu]
> Sent: Saturday, January 12, 2013 9:48 AM
> To: Himanshu Sharma
> Cc: bioconductor at r-project.org mailman
> Subject: Re: [BioC] Filtering out duplicate probes in Affy data
>
> Hi Himanshu,
>
> On 1/11/2013 4:57 PM, Himanshu Sharma wrote:
>> Dear List,
>> I have a set of mouse affy data. They platform is Affy mouse 430a2 chip.
>> There are 8 samples , 4 for each condition.
>> I normalized the data using rma. The array has 22090 probes originally.
>> Then, in order to filter out the genes which have no entrez id, are duplicates for the same gene, I used the following command .
>>
>> filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func=IQR,var.filter=T)
> You are filtering on three things here. First you require that all
> probesets have an Entrez Gene ID, then you remove any duplicates, then
> you require that the inter quartile range of the remaining data be
> greater than 0.5.
>
> This is one way of doing things. Depending on your goals, there may be
> better or worse things you could do, but that depends on your goals. If
> for instance you don't want to lose DAX, regardless of possible low
> variation, you could not filter on variation.
>
> But 'better' is a subjective term, and you are the only one who can
> decide what is better or worse in your particular situation.
>
>> This leaves me with 6579 genes after filtering. I think I loose many of the genes here. Is there a better way to do the same?.
>>
>> Also, the other problem that I am facing is that after this step, I create a expression matrix of these remaining 6579 probes.
>>
>> Now, in order to annotate them, I use the library mouse4302.db
>> I select the ids from my list and then use the following command
>> Symbol<- mouse4302SYMBOL[ids]
>>
>> This gives me a lesser number of probes and genes. I loose more data here.
>> For example, I am interested in the gene DCK, I check the original annotation file of affymetrix and there are 3 probes that are present for this gene. That means that it should have annotation. But in the final dataset I do not find it.
>>
>> Can anyone suggest a better method or any corrections to the approach that I am using. I eventually need to merge this data with other data from affy and check for the expression values but, i figured out that I am  not getting the right amount of genes.
> There is no such thing as 'the right amount of genes'. There are only
> assumptions and tradeoffs. You can make the assumption that genes with
> an IQR<  0.5 are not really changing enough to consider, and then filter
> them out. Or you can assume that smaller variation is still biologically
> meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you can
> assume that duplicated genes on the Affy Mouse 430 chip are really
> measuring different splice variants or some such, and you want to keep
> them all in the data set.
>
> All these assumptions have tradeoffs, including the possibility that you
> are wrong and you are polluting your dataset with noise, or
> unnecessarily increasing the multiplicity of your comparisons. But in
> the end it is up to the analyst to decide what assumptions are to be
> made, and to be prepared to defend those assumptions to those higher up
> (your PI, your funding source, journal reviewers, whomever).
>
> Best,
>
> Jim
>
>
>> Any help is much appreciated. I am a newbie to R and Biconductor, so I am sorry if it is a basic question.
>> Thank you all in advance for your help.
>> Thanks,
>> Himanshu
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099