[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges
何尧
heyao at pku.edu.cn
Tue Apr 5 19:27:48 CEST 2016
I do have a bunch of genes ( nearly ~50000) from the whole genome, which read in genomic ranges
A range(gene) can be seem as an observation has three columns chromosome, start and end, like that
seqnames start end width strand
gene1 chr1 1 5 5 +
gene2 chr1 10 15 6 +
gene3 chr1 12 17 6 +
gene4 chr1 20 25 6 +
gene5 chr1 30 40 11 +
I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges
For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:
gene_nameupstream_genedownstream_geneoverlapped_gene
gene1NAgene2NA
gene2gene1gene4gene3
gene3gene1gene4gene2
gene4gene3gene5NA
Currently , the strategy I use is like that,
library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
#cat(idx, "\n")
curr_gene <- all_genes_gr[idx]
other_genes <- all_genes_gr[-idx]
n <- countOverlaps(curr_gene, other_genes)
gene <- subsetByOverlaps(curr_gene, other_genes)
return(list(n, gene))
}
system.time(lapply(1:100, function(idx) find_overlapped_gene(idx, all_genes_gr)))
However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene.
I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less
[[alternative HTML version deleted]]
More information about the R-help
mailing list