[BioC] GEOquery and GEO issues

Mon Jan 23 14:25:33 CET 2006

Dear Sean

Thank you for this valuable suggestion, using match will be the way to
go.
Sorry, I thought that you may have at least close contact to the GEO
people.

Best regards
Christian

==============================================
Christian Stratowa, PhD
Boehringer Ingelheim Austria
Dept NCE Lead Discovery - Bioinformatics
Dr. Boehringergasse 5-11
A-1121 Vienna, Austria
Tel.: ++43-1-80105-2470
Fax: ++43-1-80105-2782
email: christian.stratowa at vie.boehringer-ingelheim.com

-----Original Message-----
From: Sean Davis [mailto:sdavis2 at mail.nih.gov] 
Sent: Monday, January 23, 2006 14:04
To: Stratowa,Dr.,Christian FEX BIG-AT-V; Bioconductor
Subject: Re: [BioC] GEOquery and GEO issues

On 1/23/06 5:18 AM, "Christian.Stratowa at vie.boehringer-ingelheim.com"
<Christian.Stratowa at vie.boehringer-ingelheim.com> wrote:

> Dear Sean
> 
> While trying to find a parser for the GEO soft files I encoutered your

> GEOquery package which works great. Nevertheless, I would like to 
> mention two issues which might be of general
> interest:
> 
> 1, Memory problems:
> I have downloaded from GEO the file 'GSE2109_family.soft.gz' first 
> (due to our proxy settings I cannot use getGEO for this purpose) and 
> then imported it into R with: gse2109 <- 
> getGEO(filename='GSE2109_family.soft.gz')
> Although I have succeeded in importing the file into R, it took 39.3 
> hours on a 64 bit Opteron machine with 16 GB RAM and used 9.7 GB RAM. 
> The final .Rdata file has a size of 2.0 GB. Maybe, a future version of

> GEOquery could reduce both time and memory consumption.

This is obviously a problem with large GSEs.

> 2, Non-unique GEO platforms:
> I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' 
> where we had to use both the Affymetrix HGU95A and HGU95Av2 chips. In 
> my personal opinion it is a serious flaw of the GEO
> database that it declares both chips as single platform GPL91.
> In your description of the GEOquery package, chapter 4.3 Converting
GSE to
> an exprSet, you supply
> code to make sure that all of the GSMs are from the same platform (see
my
> small function below).
> Sorrowly, this is not sufficient in this case (and probably other
Affymetrix
> chips where two versions exist).
> Even though the Sample_data_row_count is different (12625 vs 12626)
cbind
> simply recylces the rows.
> In this case, I could test if Sample_data_row_count is identical for
all
> chips, but theoretically there may
> be the case that different chip versions may still have the same
number of
> probe sets. 
> One possibility would be that GEO forces the submitters not only to
supply
> Sample_platform_id, but
> also a "Sample_platform_title" which would contain the name of the
chip as
> given by the manufacturer.

Just to clarify--I am in no way affiliated with GEO and have no control
over the way their database functions or what is stored in it.  I have
simply tried to provide a means to easily parse as much of GEO data as
possible.

As for your situation, this is easily remedied:

Instead of using 'cbind' blindly (which assumes that the GPL and the
data are in the same order, which they need not be), use match first.
In fact, that is probably the safest way to do things--I'll change the
vignette. Something like this:

 probesets <- Table(GPLList(gse)[[1]])$ID

 dat <- do.call('cbind',lapply(GSMList(gse),function(x)
    {tab <- Table(x)
     mymatch <- match(probesets,tab$ID_REF)
     return(tab$VALUE[mymatch])
     }
    )
   )

> 
> 3, Sample descriptions:
> Since most data are useless w/o the sample description, which contains

> the clinical data, it would be helpful if GEO would supply a certain 
> format for adding the clinical data, so that it would be
> possible to write a parser to extract these data automatically into a
table.

Again, I do not have any control over what GEO does with regard to
clinical annotation.  Where the clinical data is present, it should be
possible to write a specific function or set of functions to extract it;
writing a general function to do this is currently not possible for GSEs
for the reason that you note--there isn't a specified format.

I hope this clarifies things a bit.  Thanks for the constructive
feedback.

Sean