[R] FW: How to parse a string (by a "new" markup) with R ?

Tue Mar 16 23:51:22 CET 2010

A version using regular expressions, regexpr() and substr() functions is attached.
Finally everything is packed into splitSeq() function (chunk 14 in the attached file)

Seq<- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str<- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."

x<-splitSeq(Seq,Str)
x

        stem0     stem1      stem2     stem3     stem4    
before  ""        ""         ""        ""        ""       
opening "GCCTCGA" "GCTC"     "TACGA"   "ACCAG"   ""       
inside  ""        "AGTTGGGA" "CTGAAGA" "TTCGATC" ""       
closing ""        "GAGC"     "TCGTA"   "CTGGT"   "TCGGGGC"
after   "TA"      "G"        "AGGtC"   ""        "A"   

# You can make a list with

lapply(apply(x,2,list),FUN=function(x) as.list(unlist(x)))

###################################################
### chunk number 14: splitSeq
###################################################
splitSeq <- function(Seq,Str){
#
#  Functions
#
getStem <- function(pattern,Seq=Seq,Str=Str){
(id <- gregexpr(pattern,Str))
X <- sapply(id,FUN=function(x,Seq) substring(Seq,x,x+attr(x,"match.length")-1),Seq=Seq)
str <- sapply(id,FUN=function(x,Seq) substring(Seq,x,x+attr(x,"match.length")-1),Seq=Str)
str <- str[!X==""]
X<-X[!X==""]
if(length(X)==0) X <- str <- ""
return(cbind(X,str))
}
#
splitStem <- function(x){
str <- x[2]
X <- x[1]
(y <- getStem("^[.]+",X,str)[1])
(X <- substr(X,nchar(y)+1,nchar(X)))
(str <-substr(str,nchar(y)+1,nchar(str)))
before <- y
(y <- getStem("^>+",X,str)[1])
(X <- substr(X,nchar(y)+1,nchar(X)))
(str <-substr(str,nchar(y)+1,nchar(str)))
opening <- y
(y<- getStem("^[.]*",X,str))
(X <- substr(X,nchar(y)+1,nchar(X)))
(str <-substr(str,nchar(y)+1,nchar(str)))
inside  <- y
(y <- getStem("^<+",X,str))
(X <- substr(X,nchar(y)+1,nchar(X)))
(str <-substr(str,nchar(y)+1,nchar(str)))
closing  <- y
(y <- getStem("^[.]*$",X,str))
(X <- substr(X,nchar(y)+1,nchar(X)))
(str <-substr(str,nchar(y)+1,nchar(str)))
after  <- y
return(c(before=before[1],
opening=opening[1],
inside=inside[1],
closing=closing[1],
after=after[1]))
}
#
##### main part
#
# split sequence into stems
#
(stem0 <- getStem("^[.]*>{7}[.]*",Seq,Str))
(stem4 <-getStem("[.]*<{7}[.]*$",Seq,Str))
(str <- substring(Str,nchar(stem0[1])+1,nchar(Str)-nchar(stem4[1])))
(seq <- substring(Seq,nchar(stem0[1])+1,nchar(Seq)-nchar(stem4[1])))
(stems <- getStem("[.]*>+[.]+<+[.]*",seq,str))
(stems <- rbind(stem0,stems,stem4))
#
#  make parts
#
(parts <- apply(stems,1,splitStem))
#
# correct position of after string
dimnames(parts)[[2]] <- paste("stem",0:4,sep="")
parts["after",1] <- parts["inside",1]
parts["inside",1] <- ""
#
return(parts)
}
############################################################

Andrej

--
Andrej Blejec
National Institute of Biology
Vecna pot 111 POB 141
SI-1000 Ljubljana
SLOVENIA
e-mail: andrej.blejec at nib.si
URL: http://ablejec.nib.si
tel: + 386 (0)59 232 789
fax: + 386 1 241 29 80
--------------------------
Local Organizer of ICOTS-8
International Conference on Teaching Statistics http://icots8.org

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
> project.org] On Behalf Of Gabor Grothendieck
> Sent: Tuesday, March 16, 2010 3:24 PM
> To: Tal Galili
> Cc: r-help at r-project.org; seqinr-forum at r-forge.wu-wien.ac.at
> Subject: Re: [R] How to parse a string (by a "new" markup) with R ?
> 
> We show how to use the gsubfn package to parse this.
> 
> The rules are not entirely clear so we will assume the following:
> 
> - there is a fixed template for the output which is the same as your 
> output but possibly with different character strings filled in.  This 
> implies, for example, that there are exactly Stem0, Stem1, Stem2 and
> Stem3 and no fewer or more stems.
> 
> - the sequence always starts with the open of Stem0, at least one dot 
> and the open of Stem1.  There are no dots prior to the open of Stem0.
> This seems to be implicit in your sample output since there is no zero 
> length string in your sample output corresponding to dots prior to 
> Stem0.
> 
> - Stem0 closes with the same number of < as there are > to open it
> 
> You can modify this yourself to take into account the actual rules 
> whatever they are.
> 
> We first calculate, k, the number of leading >'s using strapply.
> 
> Then we replace the leading k >'s with }'s and the trailing k <'s with 
> {'s giving us Str3:
> 
> 
> "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{
> {
> {{."
> 
> We again use strapply, this time to get the lengths of the runs.  Note 
> that zero length runs are possible so we cannot, for example, use rle 
> for this.  For example there is a zero length run of dots between the 
> last < and the first {.
> read.fwf is used to actually parse out the strings using the lengths 
> we just calculated.
> 
> Finally we fill in the template using relist.
> 
> # inputs
> 
> Seq <-
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGG
> G
> GCA"
> Str <-
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<
> <
> <<."
> template <-
>   list(
>     "Stem 0 opening" = "",
>     "before Stem 1" = "",
>     "Stem 1" = list(opening = "",
>     inside = "",
>     closing = ""
>     ),
>     "between Stem 1 and 2" = "",
>     "Stem 2" = list(opening = "",
>     inside = "",
>     closing = ""
>     ),
>     "between Stem 2 and 3" = "",
>     "Stem 3" = list(opening = "",
>     inside = "",
>     closing = ""
>     ),
>     "After Stem 3" = "",
>     "Stem 0 closing" = ""
>    )
> 
> # processing
> 
> # create string made by repeating string s k times followed by more 
> reps <- function(s, k, more = "") {
> 	paste(paste(rep(s, k), collapse = ""), more, sep = "") }
> 
> library(gsubfn)
> k <- nchar(strapply(Str, "^>+", c)[[1]])
> Str2 <- sub("^>+", reps("}", k), Str)
> Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2)
> 
> pat <-
> "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]
> *
> )({*)([.]*)$"
> lens <- sapply(strapply(Str3, pat, c)[[1]], nchar) tokens <- 
> unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE))
> closeAllConnections()
> tokens[is.na(tokens)] <- ""
> out <- relist(tokens, template)
> out
> 
> 
> Here is the str of the output for your sample input:
> 
> > str(out)
> List of 9
>  $ Stem 0 opening      : chr "GCCTCGA"
>  $ before Stem 1       : chr "TA"
>  $ Stem 1              :List of 3
>   ..$ opening: chr "GCTC"
>   ..$ inside : chr "AGTTGGGA"
>   ..$ closing: chr "GAGC"
>  $ between Stem 1 and 2: chr "G"
>  $ Stem 2              :List of 3
>   ..$ opening: chr "TACGA"
>   ..$ inside : chr "CTGAAGA"
>   ..$ closing: chr "TCGTA"
>  $ between Stem 2 and 3: chr "AGGtC"
>  $ Stem 3              :List of 3
>   ..$ opening: chr "ACCAG"
>   ..$ inside : chr "TTCGATC"
>   ..$ closing: chr "CTGGT"
>  $ After Stem 3        : chr ""
>  $ Stem 0 closing      : chr "TCGGGGC"
> 
> 
> 
> On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com>
> wrote:
> > Hello all,
> >
> > For some work I am doing on RNA, I want to use R to do string 
> > parsing
> that
> > (I think) is like a simplistic HTML parsing.
> >
> >
> > For example, let's say we have the following two variables:
> >
> >    Seq <-
> >
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGG
> G
> GCA"
> >    Str <-
> >
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<
> <
> <<."
> >
> > Say that I want to parse "Seq" According to "Str", by using the
> legend here
> >
> > Seq:
> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG
> G
> CA
> > Str:
> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<
> >>>>>>><
> <.
> >
> >     |     |  |              | |               |     |
> ||     |
> >
> >     +-----+  +--------------+ +---------------+     +---------------
> ++-----+
> >
> >        |        Stem 1            Stem 2                 Stem 3
>   |
> >
> >        |
>    |
> >
> >        
> > +-------------------------------------------------------------
> ---+
> >
> >                                Stem 0
> >
> > Assume that we always have 4 stems (0 to 3), but that the length of
> letters
> > before and after each of them can very.
> >
> > The output should be something like the following list structure:
> >
> >
> >    list(
> >     "Stem 0 opening" = "GCCTCGA",
> >     "before Stem 1" = "TA",
> >     "Stem 1" = list(opening = "GCTC",
> >     inside = "AGTTGGGA",
> >     closing = "GAGC"
> >     ),
> >     "between Stem 1 and 2" = "G",
> >     "Stem 2" = list(opening = "TACGA",
> >     inside = "CTGAAGA",
> >     closing = "TCGTA"
> >     ),
> >     "between Stem 2 and 3" = "AGGtC",
> >     "Stem 3" = list(opening = "ACCAG",
> >     inside = "TTCGATC",
> >     closing = "CTGGT"
> >     ),
> >     "After Stem 3" = "",
> >     "Stem 0 closing" = "TCGGGGC"
> >    )
> >
> >
> > I don't have any experience with programming a parser, and would 
> > like advices as to what strategy to use when programming something 
> > like
> this (and
> > any recommended R commands to use).
> >
> >
> > What I was thinking of is to first get rid of the "Stem 0", then go
> through
> > the inner string with a recursive function (let's call it
> "seperate.stem")
> > that each time will split the string into:
> > 1. before stem
> > 2. opening stem
> > 3. inside stem
> > 4. closing stem
> > 5. after stem
> >
> > Where the "after stem" will then be recursively entered into the 
> > same function ("seperate.stem")
> >
> > The thing is that I am not sure how to try and do this coding 
> > without
> using
> > a loop.
> >
> > Any advices will be most welcomed.
> >
> >
> > ----------------Contact
> > Details:-------------------------------------------------------
> > Contact me: Tal.Galili at gmail.com |  972-52-7275845 Read me: 
> > www.talgalili.com (Hebrew) | www.biostatistics.co.il
> (Hebrew) |
> > www.r-statistics.com (English)
> > --------------------------------------------------------------------
> > -
> -------------------------
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained, 
> reproducible code.