[R] How to parse a string (by a "new" markup) with R ?
Gabor Grothendieck
ggrothendieck at gmail.com
Tue Mar 16 15:24:17 CET 2010
We show how to use the gsubfn package to parse this.
The rules are not entirely clear so we will assume the following:
- there is a fixed template for the output which is the same as your
output but possibly with different character strings filled in. This
implies, for example, that there are exactly Stem0, Stem1, Stem2 and
Stem3 and no fewer or more stems.
- the sequence always starts with the open of Stem0, at least one dot
and the open of Stem1. There are no dots prior to the open of Stem0.
This seems to be implicit in your sample output since there is no zero
length string in your sample output corresponding to dots prior to
Stem0.
- Stem0 closes with the same number of < as there are > to open it
You can modify this yourself to take into account the actual rules
whatever they are.
We first calculate, k, the number of leading >'s using strapply.
Then we replace the leading k >'s with }'s and the trailing k <'s with
{'s giving us Str3:
"}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{{{."
We again use strapply, this time to get the lengths of the runs. Note that
zero length runs are possible so we cannot, for example, use rle for this. For
example there is a zero length run of dots between the last < and the first {.
read.fwf is used to actually parse out the strings using the lengths we just
calculated.
Finally we fill in the template using relist.
# inputs
Seq <-
"GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <-
">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
template <-
list(
"Stem 0 opening" = "",
"before Stem 1" = "",
"Stem 1" = list(opening = "",
inside = "",
closing = ""
),
"between Stem 1 and 2" = "",
"Stem 2" = list(opening = "",
inside = "",
closing = ""
),
"between Stem 2 and 3" = "",
"Stem 3" = list(opening = "",
inside = "",
closing = ""
),
"After Stem 3" = "",
"Stem 0 closing" = ""
)
# processing
# create string made by repeating string s k times followed by more
reps <- function(s, k, more = "") {
paste(paste(rep(s, k), collapse = ""), more, sep = "")
}
library(gsubfn)
k <- nchar(strapply(Str, "^>+", c)[[1]])
Str2 <- sub("^>+", reps("}", k), Str)
Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2)
pat <-
"^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)({*)([.]*)$"
lens <- sapply(strapply(Str3, pat, c)[[1]], nchar)
tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE))
closeAllConnections()
tokens[is.na(tokens)] <- ""
out <- relist(tokens, template)
out
Here is the str of the output for your sample input:
> str(out)
List of 9
$ Stem 0 opening : chr "GCCTCGA"
$ before Stem 1 : chr "TA"
$ Stem 1 :List of 3
..$ opening: chr "GCTC"
..$ inside : chr "AGTTGGGA"
..$ closing: chr "GAGC"
$ between Stem 1 and 2: chr "G"
$ Stem 2 :List of 3
..$ opening: chr "TACGA"
..$ inside : chr "CTGAAGA"
..$ closing: chr "TCGTA"
$ between Stem 2 and 3: chr "AGGtC"
$ Stem 3 :List of 3
..$ opening: chr "ACCAG"
..$ inside : chr "TTCGATC"
..$ closing: chr "CTGGT"
$ After Stem 3 : chr ""
$ Stem 0 closing : chr "TCGGGGC"
On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com> wrote:
> Hello all,
>
> For some work I am doing on RNA, I want to use R to do string parsing that
> (I think) is like a simplistic HTML parsing.
>
>
> For example, let's say we have the following two variables:
>
> Seq <-
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
> Str <-
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
>
> Say that I want to parse "Seq" According to "Str", by using the legend here
>
> Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
> Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
>
> | | | | | | | || |
>
> +-----+ +--------------+ +---------------+ +---------------++-----+
>
> | Stem 1 Stem 2 Stem 3 |
>
> | |
>
> +----------------------------------------------------------------+
>
> Stem 0
>
> Assume that we always have 4 stems (0 to 3), but that the length of letters
> before and after each of them can very.
>
> The output should be something like the following list structure:
>
>
> list(
> "Stem 0 opening" = "GCCTCGA",
> "before Stem 1" = "TA",
> "Stem 1" = list(opening = "GCTC",
> inside = "AGTTGGGA",
> closing = "GAGC"
> ),
> "between Stem 1 and 2" = "G",
> "Stem 2" = list(opening = "TACGA",
> inside = "CTGAAGA",
> closing = "TCGTA"
> ),
> "between Stem 2 and 3" = "AGGtC",
> "Stem 3" = list(opening = "ACCAG",
> inside = "TTCGATC",
> closing = "CTGGT"
> ),
> "After Stem 3" = "",
> "Stem 0 closing" = "TCGGGGC"
> )
>
>
> I don't have any experience with programming a parser, and would like
> advices as to what strategy to use when programming something like this (and
> any recommended R commands to use).
>
>
> What I was thinking of is to first get rid of the "Stem 0", then go through
> the inner string with a recursive function (let's call it "seperate.stem")
> that each time will split the string into:
> 1. before stem
> 2. opening stem
> 3. inside stem
> 4. closing stem
> 5. after stem
>
> Where the "after stem" will then be recursively entered into the same
> function ("seperate.stem")
>
> The thing is that I am not sure how to try and do this coding without using
> a loop.
>
> Any advices will be most welcomed.
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list