[BioC] Question about using Biostrings & BSgenome
J.delasHeras at ed.ac.uk
J.delasHeras at ed.ac.uk
Sat Sep 20 15:15:47 CEST 2008
Hi Joern,
that was useful, thank you! I have some new homework to do now. :-)
As for what I'm after exactly... it'll be various things at various
times, but I can give you one very specific example right now.
I have a human promoter array in my hands (and soon a mouse one). Each
probeset covers a region of around 2.2kb upstream and 0.5kb downstream
the TSS.
Now... in reality, some genes have multiple TSSs... sometimes they are
close, sometimes far apart. Also, each probeset may be longer than the
2.7Kb expected, for instance if you have two genes going in different
directions starting in a short region. I want to dissect all this out.
I want to find all the genes, all the TSSs, and create "my own"
probesets (from the probes available to me in the array) based on
these TSSs and covering a region defined by me also (I may choose to
create probesets just +/-400bp around the TSS, and other perhaps
covering the 1kb region located -1000 to -2000bp from the TSS) etc.
And later on I may have another requirement, depending on my findings
and whatever I may be looking for.
So I need to locate the TSSs. Then I have to decide for each gene with
multiple TSSs, which ones are just too close to make any significant
difference to my results so that I can treat them as one, and which
ones are further apart so that I treat them as distinct (different
promoter regions for a single gene). I would do that based on the TSS
locations (and orientation), so it seems simple enough. Then with
those locations, I can search the array annotation and figure out
which ones are located within the subareas I want. I can do that based
on positions alone, but I'd like to have the actual sequences (not
just the probes, but the whole region) because in some cases I am
looking for particular motifs, and even something simple like
restriction sites...
For promoter arrays this won't apply, but I also have tiling arrays
for a couple of human chromosomes, and in this case I'll find it
interesting to separate probesets from exons, introns... I want to
sometimes consider a region of x bp around the 5' end of the
transcript and another around the 3'...
I already have some annotation provided, but I think it's probably
easier to look it up myself (from teh probe locations & their given
sequence) and that way create the annotation I find useful for my
purposes, than adapting whatever was given to me. Especially as it
seems (on paper) a relatively simple procedure that can be achieved
now entirely from R.
I will come up with more detailed questions probably once I start
applying these tools to my problems.
Jose
Quoting Joern Toedling <toedling at ebi.ac.uk>:
> Hello,
>
> Biostrings and BSgenome can certainly be used to retrieve genomic
> sequences. For instance, here's a very basic function I have used many
> times to retrieve the sequence of short genome segments on either strand
> of budding yeast.
>
> getYeastSeq <- function(chr, start, end, strand="+"){
> stopifnot(length(chr)==1, length(start)==1, length(end)==1)
> require("BSgenome.Scerevisiae.UCSC.sacCer1")
> strand <- match.arg(strand, c("+","-"))
> thisSeq <- gsub("[[:space:]]","", as.character(getSeq(Scerevisiae,
> gsub("17","M",paste("chr",chr,sep="")), start=start, end=end)))
> if (strand=="-")
> thisSeq <- as.character(reverseComplement(DNAString(thisSeq)))
> return(thisSeq)
> }#getYeastSeq
>
> getYeastSeq(chr=2, start=200000, end=200020) ## test
>
> Biostrings offers many utility functions to work with DNA sequences. And
> you can always convert the sequences into character vectors and use
> basic R operations on those. Not sure what other games you have in mind
> when you say "play", but I guess a more precise question whether you can
> do XYZ with Biostrings or any other Bioconductor package will result in
> a more informative answer.
>
> Regards,
> Joern
>
>
> J.delasHeras at ed.ac.uk wrote:
>>
>> I haven't yet used either of these packages, but it looks like
>> something I may want to look at.
>>
>> I was wondering if I can use these packages together with something
>> like 'BSgenome.Hsapiens.UCSC.hg18' to extract sequences around every
>> TSS, for instance.
>> I have a couple of different oligo array designs, both in human and
>> mouse, and I would like to subset probes according to a number of
>> criteria, such as "promoter", "intergenic", etc...
>> I'm not yet familiar with these packages but I suspect they will
>> provide all teh tools I need to extract and "play" with genomic
>> sequences.
>>
>> Am I right?
>>
>> Anybody has some examples to help me get a better overview, beyond
>> those in the vignettes?
>>
>> Thanks.
>>
>> Jose
>>
>
> --
> Joern Toedling
> EMBL - European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridge CB10 1SD
> United Kingdom
> Phone +44(0)1223 492566
> Email toedling at ebi.ac.uk
>
>
>
--
Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Bioconductor
mailing list