[BioC] How can I identify the closest transcript from a chromosome coordinate?
Yoo, Seungyeul
seungyeul.yoo at mssm.edu
Fri Jul 20 04:36:56 CEST 2012
Dear all,
I'm working on a DNA Methylation microarray dataset. The microarray design is "pd.feinberg.hg18.me.hx1".
I used the CHARM package to estimate methylation percentile and selected 1000 probes having larger variances of methylation level across samples.
The 1000 probe are identified as chromosome coordinate like following.
> rnames[1:10]
[1] "chr1:1707145" "chr1:2148663" "chr1:3133683" "chr1:3180808" "chr1:3294081"
[6] "chr1:3470900" "chr1:3470969" "chr1:3633816" "chr1:3676205" "chr1:3720637"
Now I want to see the gene expression of these 1000 probes and see the correlation between gene expression and dna methylation.
I loaded human genome transcript information from UCSC and extracted features of all transcripts like followings.
hg18KG<-loadFeatures("hg18_UCSC.sqlite")
tbl_tx<-select(hg18KG,keys(hg18KG,"GENEID"),cols=c("GENEID","TXNAME","TXCHROM","TXSTRAND","TXSTART","TXEND"),keytype="GENEID")
> tbl_tx[1:10,]
GENEID TXNAME TXCHROM TXSTRAND TXSTART TXEND
1 1 uc002qsd.2 chr19 - 63549984 63556677
2 1 uc002qsf.1 chr19 - 63551644 63565932
3 10 uc003wyw.1 chr8 + 18293035 18303003
4 10 uc010lte.1 chr8 + 18301794 18302666
5 100 uc002xmj.1 chr20 - 42681577 42713790
6 100 uc010ggt.1 chr20 - 42681577 42713790
7 1000 uc002kwg.1 chr18 - 23784933 24011189
8 10000 uc001iaa.2 chr1 - 241731689 241733518
9 10000 uc001hzz.1 chr1 - 241718158 242073207
10 10000 uc001iab.1 chr1 - 241733107 242073207
For each of 1000 probes, I want to find the closest transcript starting point (TXSTART).
But I don't know how to treat strand. There was no strand information provided from raw data but transcripts have strand information (either "+" or "-").
How I can calculate distance from probe coordinate to transcript starting point which is on strand "+" or "-"?
Can I just ignore "+" or "-" which allows me to treat +111111 and -111111 in the same way? My guess they should be different because genome sequence shouldn't be symmetric.
I just started to join genomics field from different area and have little experience working on genome sequences. Sorry for my naive question.
But any comments about this, even conceptual ones, would be very helpful for me.
Thank you.
Seungyeul Yoo
Postdoctoral Fellow
Institute of Genomics and Multiscale Biology
Department of Genetics and Genomic Sciences
Mount Sinai School of Medicine
(office) 212-659-6877
More information about the Bioconductor
mailing list