[BioC] Repeat masker sequences as GRanges object
Hervé Pagès
hpages at fhcrc.org
Fri Sep 12 23:44:20 CEST 2014
On 09/12/2014 12:29 PM, James W. MacDonald wrote:
> Pfft. Downloading. Ain't nobody got time fo' dat! ;-D
>
> library(RMySQL)
> library(GenomicRanges)
> con <- dbConnect("MySQL", user="genome", host="genome-mysql.cse.ucsc.edu
> <http://genome-mysql.cse.ucsc.edu>", db="hg19")
> rmsk <- dbGetQuery(con, "select * from rmsk;")
> rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
> seqnames.field="genoName",
> start.field="genoStart",
> end.field="genoEnd",
> strand.field="strand",
> starts.in.df.are.0based=TRUE)
Yeah, much better :) Thanks Jim. Although I guess you still kind of
"download" the data in a way.
The SQL query approach has at least 2 benefits that are really worth
emphasizing:
1) You get a data.frame with the col names already set for you.
2) You don't need to download the full table:
> dbGetQuery(con, "select * from rmsk limit 4;")
bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd
genoLeft
1 585 1504 13 4 13 chr1 10000 10468
-249240153
2 585 3612 114 270 13 chr1 10468 11447
-249239174
3 585 437 235 186 35 chr1 11503 11675
-249238946
4 585 239 294 19 10 chr1 11677 11780
-249238841
strand repName repClass repFamily repStart repEnd repLeft id
1 + (CCCTAA)n Simple_repeat Simple_repeat 1 463 0 1
2 - TAR1 Satellite telo -399 1712 483 2
3 - L1MC LINE L1 -2236 5646 5449 3
4 - MER5B DNA hAT-Charlie -74 104 1 4
H.
>
> Best,
>
> Jim
>
>
>
>
> On Fri, Sep 12, 2014 at 2:08 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi,
>
> On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>
> Hi Hermann,
>
> How about this:
>
> library(AnnotationHub)
> hub <- AnnotationHub()
> hub$goldenpath.hg19.database.__rmsk_0.0.1.RData
>
> GRanges with 5298130 ranges and 2 metadata columns:
> seqnames ranges strand |
> name
> <Rle> <IRanges> <Rle> |
> <character>
> [1] chr1 [16777161, 16777470] + |
> AluSp
> [2] chr1 [25165801, 25166089] - |
> AluY
> [3] chr1 [33553607, 33554646] + |
> L2b
> [4] chr1 [50330064, 50332153] + |
> L1PA10
> [5] chr1 [58720068, 58720973] - |
> L1PA2
> ... ... ... ... ...
> ...
> [5298126] chr21_gl000210_random [25379, 25875] + |
> MER74B
> [5298127] chr21_gl000210_random [26438, 26596] - |
> MIRc
> [5298128] chr21_gl000210_random [26882, 27022] - |
> MIRc
> [5298129] chr21_gl000210_random [27297, 27447] + |
> HAL1-2a_MD
> [5298130] chr21_gl000210_random [27469, 27682] + |
> HAL1-2a_MD
> score
> <numeric>
> [1] 2147
> [2] 2626
> [3] 626
> [4] 12545
> [5] 8050
> ... ...
> [5298126] 1674
> [5298127] 308
> [5298128] 475
> [5298129] 371
> [5298130] 370
> ---
> seqlengths:
> chr1 chr2 ...
> chr18_gl000207_random
> 249250621 243199373 ...
> 4262
>
> This is a GRanges of all features from UCSC's Repeat Masker table.
>
>
> Nice to see the "name" and "score" metadata cols but I wonder why the
> original UCSC names for these cols (which are "repName" and "swScore")
> were not preserved. Also, other UCSC cols in the rmsk table at UCSC
> might be of interest (e.g. "repClass" and "repFamily").
>
> FWIW, here is one way to get these cols:
>
> local_file <- tempfile()
>
> download.file("http://__hgdownload.soe.ucsc.edu/__goldenPath/hg19/database/rmsk.__txt.gz
> <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz>",
> local_file)
>
> ## Get the col names from
> ##
> http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql
> <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql>
> COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
> "genoName", "genoStart", "genoEnd", "genoLeft",
> "strand", "repName", "repClass", "repFamily",
> "repStart", "repEnd", "repLeft", "id")
>
> library(GenomicRanges)
> df <- read.table(local_file, col.names=COLNAMES)
> rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
> seqnames.field="genoName",
> start.field="genoStart",
> end.field="genoEnd",
> strand.field="strand",
> starts.in.df.are.0based=TRUE)
>
> Then:
>
> > head(rmsk)
> GRanges with 6 ranges and 13 metadata columns:
> seqnames ranges strand | bin swScore
> milliDiv milliDel
> <Rle> <IRanges> <Rle> | <integer> <integer>
> <integer> <integer>
> [1] chr1 [10001, 10468] + | 585 1504
> 13 4
> [2] chr1 [10469, 11447] - | 585 3612
> 114 270
> [3] chr1 [11504, 11675] - | 585 437
> 235 186
> [4] chr1 [11678, 11780] - | 585 239
> 294 19
> [5] chr1 [15265, 15355] - | 585 318
> 230 38
> [6] chr1 [16713, 16749] + | 585 203
> 162 0
> milliIns genoLeft repName repClass repFamily
> repStart
> <integer> <integer> <factor> <factor> <factor>
> <integer>
> [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat
> 1
> [2] 13 -249239174 TAR1 Satellite telo
> -399
> [3] 35 -249238946 L1MC LINE L1
> -2236
> [4] 10 -249238841 MER5B DNA hAT-Charlie
> -74
> [5] 0 -249235266 MIR3 SINE MIR
> -119
> [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat
> 1
> repEnd repLeft id
> <integer> <integer> <integer>
> [1] 463 0 1
> [2] 1712 483 2
> [3] 5646 5449 3
> [4] 104 1 4
> [5] 143 49 5
> [6] 37 0 6
>
> I also tried with rtracklayer but got the following error:
>
> > library(rtracklayer)
> > session <- browserSession()
> > genome(session) <- "hg19"
> > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
> > system.time(rmsk <- getTable(query))
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, :
> line 3813855 did not have 17 elements
> Timing stopped at: 551.027 7.24 1304.386
>
> Could be due to my flaky internet connection though...
>
> Cheers,
> H.
>
> > sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
> [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19
> [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6
> [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0
> [10] digest_0.6.4 fail_1.2 foreach_1.4.2
> [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3
> [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2
> [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1
> [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1
>
>
> Best,
>
> Jim
>
>
>
>
> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois
> <hnorpois at gmail.com <mailto:hnorpois at gmail.com>> wrote:
>
> Hello,
>
> I would like to have repeat sequences as GRanges object
> I started with ...
>
> library (BSgenome.Hsapiens.UCSC.hg19)
> ch1 <- Hsapiens$chr1
> active (masks (ch1))
> AGAPS AMB RM TRF
> TRUE TRUE FALSE FALSE
> active (masks(ch1))["RM"] <- TRUE
> active (masks (ch1))
> AGAPS AMB RM TRF
> TRUE TRUE TRUE FALSE
>
> Can anyboldy give me a hint how to continue.
>
> Thanks
> hermann
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list