[R] Is that an efficient way to find the overlapped , upstream and downstream rangess for a bunch of rangess
Yao He
yao.h.1988 at gmail.com
Tue Apr 5 19:29:36 CEST 2016
I do have a bunch of genes ( nearly ~50000) from the whole genome, which
read in genomic ranges
A range(gene) can be seem as an observation has three columns chromosome,
start and end, like that
seqnames start end width strand
gene1 chr1 1 5 5 +
gene2 chr1 10 15 6 +
gene3 chr1 12 17 6 +
gene4 chr1 20 25 6 +
gene5 chr1 30 40 11 +
I just wondering is there an efficient way to find *overlapped, upstream
and downstream genes for each gene in the granges*
For example, assuming all_genes_gr is a ~50000 genes genomic range, the
result I want like belows:
gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA
Currently , the strategy I use is like that,
library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
#cat(idx, "\n")
curr_gene <- all_genes_gr[idx]
other_genes <- all_genes_gr[-idx]
n <- countOverlaps(curr_gene, other_genes)
gene <- subsetByOverlaps(curr_gene, other_genes)
return(list(n, gene))
}
system.time(lapply(1:100, function(idx) find_overlapped_gene(idx,
all_genes_gr)))
However, for 100 genes, it use nearly ~8s by system.time().That means if I
had 50000 genes, nearly one hour for just find overlapped gene.
I am just wondering any algorithm or strategy to do that efficiently,
perhaps 50000 genes in ~10min or even less
Yao He
[[alternative HTML version deleted]]
More information about the R-help
mailing list