[BioC] How to get a unique line of annotation for each specific genomic position by using biomaRt package

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue Feb 8 14:48:31 CET 2011


Hi,

On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
> Dear listers,
>
> I am new to bioconductor.
>
> I have genomic variations (SNP, indel, CNV) coordinated by
> chromosome:start:end in GFF/BED/VCF format. One genomic variation is
> defined a specific genomic position (in base pair).
>
> for example:
> # SNPs,chr,start,end
> SNP_1,1,43,43
> SNP_2,2,56,56
>
> I would like to get such genomic variations annotated by various
> gen/protein/passway centric annotations (as listed in BioMart
> databases). I tried R/bioconductor biomaRt package. But, I failed to
> get a unique line of annotation for a specific genomic position. Could
> you please give any directions on that?

Could you explain a bit more about what you mean when you say "get a
unique line of annotation"?

The only informative info `getBM` query is returning is the gene id
for the location, and the GO term evidence code
(go_biological_process_linkage_type). If you add, say,
"go_biological_process_id", you get the biological go terms associated
with the position, ie:

result <- getBM(attributes=c("chromosome_name","start_position","ensembl_gene_id",
  "go_biological_process_linkage_type", "go_biological_process_id"),
  filters = c("chromosome_name", "start", "end"),
  values = list(chr, start, end), mart=alyr, uniqueRows = TRUE)

If you problem is that some positions have more than one row, like so:

chromosome_name start_position     ensembl_gene_id  ...
go_biological_process_id
              1          33055   scaffold_100013.1
GO:0006355
              1          33055   scaffold_100013.1
GO:0006886
              1          33055   scaffold_100013.1
GO:0006913
              1          33055   scaffold_100013.1
GO:0007165
              1          33055   scaffold_100013.1
GO:0007264

this happens because multiple go terms are shared at that location. If
you want to just pick one, but you'll have to decide how you want to
do that.

If you want to somehow summarize each chromosome/start_position into
one row, you can iterate over the data by this combination easily
with, say, the ddply function from the plyr package:

library(plyr)
summary <- ddply(result, .(chromosome_name, start_position), function(x) {
  # x will have all of the rows for a given chromosome_name / start_position
  # combo. We can arbitrarily just return the first row, but you'll likely
  # want to do something smarter:
  x[1,]
})

If you look at `summary`, you'll have one row per position.

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list