[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges

David Winsemius dwinsemius at comcast.net
Wed Apr 6 03:21:03 CEST 2016


> On Apr 5, 2016, at 10:27 AM, 何尧 <heyao at pku.edu.cn> wrote:
> 
> I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges
> 
> A range(gene) can be seem as an observation has three columns chromosome, start and end, like that
> 
>       seqnames start end width strand
> 
> gene1     chr1     1   5     5      +
> 
> gene2     chr1    10  15     6      +
> 
> gene3     chr1    12  17     6      +
> 
> gene4     chr1    20  25     6      +
> 
> gene5     chr1    30  40    11      +
> 
> I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

The data.table package (in CRAN) and the iRanges package (in bioC) have formalized efficient approaches to those problems.


> 
> For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:
> 
> gene_nameupstream_genedownstream_geneoverlapped_gene
> gene1NAgene2NA
> gene2gene1gene4gene3
> gene3gene1gene4gene2
> gene4gene3gene5NA
> 
> Currently ,  the strategy I use is like that,  
> library(GenomicRanges)
> find_overlapped_gene <- function(idx, all_genes_gr) {
>  #cat(idx, "\n")
>  curr_gene <- all_genes_gr[idx]
>  other_genes <- all_genes_gr[-idx]
>  n <- countOverlaps(curr_gene, other_genes)
>  gene <- subsetByOverlaps(curr_gene, other_genes)
>  return(list(n, gene))
> }​
> 
> system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))
> However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene. 
> 
> I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less
> 
I suspect this would happen on a much faster basis for such a small dataset.

-- 
David.



> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list