[BioC] Repeat masker sequences as GRanges object
Hervé Pagès
hpages at fhcrc.org
Fri Sep 12 20:08:10 CEST 2014
Hi,
On 09/12/2014 06:30 AM, James W. MacDonald wrote:
> Hi Hermann,
>
> How about this:
>
>> library(AnnotationHub)
>> hub <- AnnotationHub()
>> hub$goldenpath.hg19.database.rmsk_0.0.1.RData
> GRanges with 5298130 ranges and 2 metadata columns:
> seqnames ranges strand |
> name
> <Rle> <IRanges> <Rle> |
> <character>
> [1] chr1 [16777161, 16777470] + |
> AluSp
> [2] chr1 [25165801, 25166089] - |
> AluY
> [3] chr1 [33553607, 33554646] + |
> L2b
> [4] chr1 [50330064, 50332153] + |
> L1PA10
> [5] chr1 [58720068, 58720973] - |
> L1PA2
> ... ... ... ... ...
> ...
> [5298126] chr21_gl000210_random [25379, 25875] + |
> MER74B
> [5298127] chr21_gl000210_random [26438, 26596] - |
> MIRc
> [5298128] chr21_gl000210_random [26882, 27022] - |
> MIRc
> [5298129] chr21_gl000210_random [27297, 27447] + |
> HAL1-2a_MD
> [5298130] chr21_gl000210_random [27469, 27682] + |
> HAL1-2a_MD
> score
> <numeric>
> [1] 2147
> [2] 2626
> [3] 626
> [4] 12545
> [5] 8050
> ... ...
> [5298126] 1674
> [5298127] 308
> [5298128] 475
> [5298129] 371
> [5298130] 370
> ---
> seqlengths:
> chr1 chr2 ... chr18_gl000207_random
> 249250621 243199373 ... 4262
>
> This is a GRanges of all features from UCSC's Repeat Masker table.
Nice to see the "name" and "score" metadata cols but I wonder why the
original UCSC names for these cols (which are "repName" and "swScore")
were not preserved. Also, other UCSC cols in the rmsk table at UCSC
might be of interest (e.g. "repClass" and "repFamily").
FWIW, here is one way to get these cols:
local_file <- tempfile()
download.file("http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz",
local_file)
## Get the col names from
## http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql
COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
"genoName", "genoStart", "genoEnd", "genoLeft",
"strand", "repName", "repClass", "repFamily",
"repStart", "repEnd", "repLeft", "id")
library(GenomicRanges)
df <- read.table(local_file, col.names=COLNAMES)
rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
seqnames.field="genoName",
start.field="genoStart",
end.field="genoEnd",
strand.field="strand",
starts.in.df.are.0based=TRUE)
Then:
> head(rmsk)
GRanges with 6 ranges and 13 metadata columns:
seqnames ranges strand | bin swScore milliDiv
milliDel
<Rle> <IRanges> <Rle> | <integer> <integer> <integer>
<integer>
[1] chr1 [10001, 10468] + | 585 1504 13
4
[2] chr1 [10469, 11447] - | 585 3612 114
270
[3] chr1 [11504, 11675] - | 585 437 235
186
[4] chr1 [11678, 11780] - | 585 239 294
19
[5] chr1 [15265, 15355] - | 585 318 230
38
[6] chr1 [16713, 16749] + | 585 203 162
0
milliIns genoLeft repName repClass repFamily
repStart
<integer> <integer> <factor> <factor> <factor>
<integer>
[1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat
1
[2] 13 -249239174 TAR1 Satellite telo
-399
[3] 35 -249238946 L1MC LINE L1
-2236
[4] 10 -249238841 MER5B DNA hAT-Charlie
-74
[5] 0 -249235266 MIR3 SINE MIR
-119
[6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat
1
repEnd repLeft id
<integer> <integer> <integer>
[1] 463 0 1
[2] 1712 483 2
[3] 5646 5449 3
[4] 104 1 4
[5] 143 49 5
[6] 37 0 6
I also tried with rtracklayer but got the following error:
> library(rtracklayer)
> session <- browserSession()
> genome(session) <- "hg19"
> query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
> system.time(rmsk <- getTable(query))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, :
line 3813855 did not have 17 elements
Timing stopped at: 551.027 7.24 1304.386
Could be due to my flaky internet connection though...
Cheers,
H.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
[4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5
loaded via a namespace (and not attached):
[1] BatchJobs_1.3 BBmisc_1.7
BiocParallel_0.99.19
[4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6
[7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0
[10] digest_0.6.4 fail_1.2 foreach_1.4.2
[13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3
[16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2
[19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1
[22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1
>
> Best,
>
> Jim
>
>
>
>
> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at gmail.com> wrote:
>
>> Hello,
>>
>> I would like to have repeat sequences as GRanges object
>> I started with ...
>>
>> library (BSgenome.Hsapiens.UCSC.hg19)
>> ch1 <- Hsapiens$chr1
>> active (masks (ch1))
>> AGAPS AMB RM TRF
>> TRUE TRUE FALSE FALSE
>> active (masks(ch1))["RM"] <- TRUE
>> active (masks (ch1))
>> AGAPS AMB RM TRF
>> TRUE TRUE TRUE FALSE
>>
>> Can anyboldy give me a hint how to continue.
>>
>> Thanks
>> hermann
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list