[BioC] Repeat masker sequences as GRanges object

Fri Sep 12 23:44:20 CEST 2014

On 09/12/2014 12:29 PM, James W. MacDonald wrote:
> Pfft. Downloading. Ain't nobody got time fo' dat! ;-D
>
> library(RMySQL)
> library(GenomicRanges)
> con <- dbConnect("MySQL", user="genome", host="genome-mysql.cse.ucsc.edu
> <http://genome-mysql.cse.ucsc.edu>", db="hg19")
> rmsk <- dbGetQuery(con, "select * from rmsk;")
>   rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
>                seqnames.field="genoName",
>                start.field="genoStart",
>                end.field="genoEnd",
>                strand.field="strand",
>                starts.in.df.are.0based=TRUE)

Yeah, much better :) Thanks Jim. Although I guess you still kind of
"download" the data in a way.

The SQL query approach has at least 2 benefits that are really worth
emphasizing:

   1) You get a data.frame with the col names already set for you.

   2) You don't need to download the full table:

   > dbGetQuery(con, "select * from rmsk limit 4;")
     bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd 
  genoLeft
   1 585    1504       13        4       13     chr1     10000   10468 
-249240153
   2 585    3612      114      270       13     chr1     10468   11447 
-249239174
   3 585     437      235      186       35     chr1     11503   11675 
-249238946
   4 585     239      294       19       10     chr1     11677   11780 
-249238841
     strand   repName      repClass     repFamily repStart repEnd repLeft id
   1      + (CCCTAA)n Simple_repeat Simple_repeat        1    463       0  1
   2      -      TAR1     Satellite          telo     -399   1712     483  2
   3      -      L1MC          LINE            L1    -2236   5646    5449  3
   4      -     MER5B           DNA   hAT-Charlie      -74    104       1  4

H.

>
> Best,
>
> Jim
>
>
>
>
> On Fri, Sep 12, 2014 at 2:08 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi,
>
>     On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>
>         Hi Hermann,
>
>         How about this:
>
>             library(AnnotationHub)
>             hub <- AnnotationHub()
>             hub$goldenpath.hg19.database.__rmsk_0.0.1.RData
>
>         GRanges with 5298130 ranges and 2 metadata columns:
>                                    seqnames               ranges strand   |
>         name
>                                       <Rle>            <IRanges>  <Rle>   |
>         <character>
>                   [1]                  chr1 [16777161, 16777470]      +   |
>         AluSp
>                   [2]                  chr1 [25165801, 25166089]      -   |
>         AluY
>                   [3]                  chr1 [33553607, 33554646]      +   |
>         L2b
>                   [4]                  chr1 [50330064, 50332153]      +   |
>         L1PA10
>                   [5]                  chr1 [58720068, 58720973]      -   |
>         L1PA2
>                   ...                   ...                  ...    ... ...
>         ...
>             [5298126] chr21_gl000210_random       [25379, 25875]      +   |
>         MER74B
>             [5298127] chr21_gl000210_random       [26438, 26596]      -   |
>         MIRc
>             [5298128] chr21_gl000210_random       [26882, 27022]      -   |
>         MIRc
>             [5298129] chr21_gl000210_random       [27297, 27447]      +   |
>         HAL1-2a_MD
>             [5298130] chr21_gl000210_random       [27469, 27682]      +   |
>         HAL1-2a_MD
>                           score
>                       <numeric>
>                   [1]      2147
>                   [2]      2626
>                   [3]       626
>                   [4]     12545
>                   [5]      8050
>                   ...       ...
>             [5298126]      1674
>             [5298127]       308
>             [5298128]       475
>             [5298129]       371
>             [5298130]       370
>             ---
>             seqlengths:
>                               chr1                  chr2 ...
>         chr18_gl000207_random
>                          249250621             243199373 ...
>                4262
>
>         This is a GRanges of all features from UCSC's Repeat Masker table.
>
>
>     Nice to see the "name" and "score" metadata cols but I wonder why the
>     original UCSC names for these cols (which are "repName" and "swScore")
>     were not preserved. Also, other UCSC cols in the rmsk table at UCSC
>     might be of interest (e.g. "repClass" and "repFamily").
>
>     FWIW, here is one way to get these cols:
>
>        local_file <- tempfile()
>
>     download.file("http://__hgdownload.soe.ucsc.edu/__goldenPath/hg19/database/rmsk.__txt.gz
>     <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz>",
>     local_file)
>
>        ## Get the col names from
>        ##
>     http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql
>     <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql>
>        COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
>                      "genoName", "genoStart", "genoEnd", "genoLeft",
>                      "strand", "repName", "repClass", "repFamily",
>                      "repStart", "repEnd", "repLeft", "id")
>
>        library(GenomicRanges)
>        df <- read.table(local_file, col.names=COLNAMES)
>        rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
>                    seqnames.field="genoName",
>                    start.field="genoStart",
>                    end.field="genoEnd",
>                    strand.field="strand",
>                    starts.in.df.are.0based=TRUE)
>
>     Then:
>
>        > head(rmsk)
>        GRanges with 6 ranges and 13 metadata columns:
>              seqnames         ranges strand |       bin   swScore
>     milliDiv  milliDel
>                 <Rle>      <IRanges>  <Rle> | <integer> <integer>
>     <integer> <integer>
>          [1]     chr1 [10001, 10468]      + |       585      1504
>     13         4
>          [2]     chr1 [10469, 11447]      - |       585      3612
>       114       270
>          [3]     chr1 [11504, 11675]      - |       585       437
>       235       186
>          [4]     chr1 [11678, 11780]      - |       585       239
>       294        19
>          [5]     chr1 [15265, 15355]      - |       585       318
>       230        38
>          [6]     chr1 [16713, 16749]      + |       585       203
>       162         0
>               milliIns   genoLeft   repName      repClass     repFamily
>     repStart
>              <integer>  <integer>  <factor>      <factor>      <factor>
>     <integer>
>          [1]        13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat
>         1
>          [2]        13 -249239174      TAR1     Satellite          telo
>     -399
>          [3]        35 -249238946      L1MC          LINE            L1
>     -2236
>          [4]        10 -249238841     MER5B           DNA   hAT-Charlie
>       -74
>          [5]         0 -249235266      MIR3          SINE           MIR
>     -119
>          [6]         0 -249233872    (TGG)n Simple_repeat Simple_repeat
>         1
>                 repEnd   repLeft        id
>              <integer> <integer> <integer>
>          [1]       463         0         1
>          [2]      1712       483         2
>          [3]      5646      5449         3
>          [4]       104         1         4
>          [5]       143        49         5
>          [6]        37         0         6
>
>     I also tried with rtracklayer but got the following error:
>
>        > library(rtracklayer)
>        > session <- browserSession()
>        > genome(session) <- "hg19"
>        > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
>        > system.time(rmsk <- getTable(query))
>        Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>     na.strings,  :
>          line 3813855 did not have 17 elements
>        Timing stopped at: 551.027 7.24 1304.386
>
>     Could be due to my flaky internet connection though...
>
>     Cheers,
>     H.
>
>      > sessionInfo()
>     R version 3.1.1 (2014-07-10)
>     Platform: x86_64-unknown-linux-gnu (64-bit)
>
>     locale:
>       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>     attached base packages:
>     [1] parallel  stats     graphics  grDevices utils     datasets  methods
>     [8] base
>
>     other attached packages:
>     [1] rtracklayer_1.25.16   GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
>     [4] IRanges_1.99.28       S4Vectors_0.2.3       BiocGenerics_0.11.5
>
>     loaded via a namespace (and not attached):
>       [1] BatchJobs_1.3            BBmisc_1.7 BiocParallel_0.99.19
>       [4] Biostrings_2.33.14       bitops_1.0-6             brew_1.0-6
>       [7] checkmate_1.4            codetools_0.2-9          DBI_0.3.0
>     [10] digest_0.6.4             fail_1.2                 foreach_1.4.2
>     [13] GenomicAlignments_1.1.29 iterators_1.0.7          RCurl_1.95-4.3
>     [16] Rsamtools_1.17.33        RSQLite_0.11.4           sendmailR_1.1-2
>     [19] stats4_3.1.1             stringr_0.6.2            tools_3.1.1
>     [22] XML_3.98-1.1             XVector_0.5.8            zlibbioc_1.11.1
>
>
>         Best,
>
>         Jim
>
>
>
>
>         On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois
>         <hnorpois at gmail.com <mailto:hnorpois at gmail.com>> wrote:
>
>             Hello,
>
>             I would like to have repeat sequences as GRanges object
>             I started with ...
>
>             library (BSgenome.Hsapiens.UCSC.hg19)
>             ch1 <- Hsapiens$chr1
>             active (masks (ch1))
>             AGAPS   AMB    RM   TRF
>                TRUE  TRUE FALSE FALSE
>             active (masks(ch1))["RM"] <- TRUE
>             active (masks (ch1))
>             AGAPS   AMB    RM   TRF
>                TRUE  TRUE  TRUE FALSE
>
>             Can anyboldy give me a hint how to continue.
>
>             Thanks
>             hermann
>
>                       [[alternative HTML version deleted]]
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319