[R] Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges
David Winsemius
dwinsemius at comcast.net
Wed Apr 6 03:21:03 CEST 2016
> On Apr 5, 2016, at 10:27 AM, 何尧 <heyao at pku.edu.cn> wrote:
>
> I do have a bunch of genes ( nearly ~50000) from the whole genome, which read in genomic ranges
>
> A range(gene) can be seem as an observation has three columns chromosome, start and end, like that
>
> seqnames start end width strand
>
> gene1 chr1 1 5 5 +
>
> gene2 chr1 10 15 6 +
>
> gene3 chr1 12 17 6 +
>
> gene4 chr1 20 25 6 +
>
> gene5 chr1 30 40 11 +
>
> I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges
The data.table package (in CRAN) and the iRanges package (in bioC) have formalized efficient approaches to those problems.
>
> For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:
>
> gene_nameupstream_genedownstream_geneoverlapped_gene
> gene1NAgene2NA
> gene2gene1gene4gene3
> gene3gene1gene4gene2
> gene4gene3gene5NA
>
> Currently , the strategy I use is like that,
> library(GenomicRanges)
> find_overlapped_gene <- function(idx, all_genes_gr) {
> #cat(idx, "\n")
> curr_gene <- all_genes_gr[idx]
> other_genes <- all_genes_gr[-idx]
> n <- countOverlaps(curr_gene, other_genes)
> gene <- subsetByOverlaps(curr_gene, other_genes)
> return(list(n, gene))
> }
>
> system.time(lapply(1:100, function(idx) find_overlapped_gene(idx, all_genes_gr)))
> However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene.
>
> I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less
>
I suspect this would happen on a much faster basis for such a small dataset.
--
David.
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list