[R] Regex Split?
Leonard Mada
|eo@m@d@ @end|ng |rom @yon|c@eu
Sat May 6 01:05:56 CEST 2023
Dear Bert,
Thank you for the suggestion. Indeed, there are various solutions and
workarounds. However, there is still a bug in strsplit.
2.) gsub
I would try to avoid gsub on a Wikipedia-sized corpus: using strsplit
directly should be far more efficient.
3.) Punctuation marks
Abbreviations and "word1-word2" may be a problem:
gsub("(?<ThePunct>[[:punct:]])", "\\1 ", "A.B.C.", perl=T)
# "A. B. C. "
I do not yet have an intuition if the spaces in "A. B. C. " would
adversely affect the language model. But this goes off-topic.
Sincerely,
Leonard
On 5/6/2023 1:35 AM, Bert Gunter wrote:
> Primarily for my own amusement, here is a way to do what I think you
> wanted without look-aheads/behinds
>
> strsplit(gsub("([[:punct:]])"," \\1 ","a bc,def, adef,x; ,,gh"), " +")
> [[1]]
> [1] "a" "bc" "," "def" "," "adef" "," "x" ";"
> [10] "," "," "gh"
>
> I certainly would *not* claim that it is in any way superior to
> anything that has already been suggested -- indeed, probably the
> contrary. But it's simple (as am I).
>
> Cheers,
> Bert
>
> On Fri, May 5, 2023 at 2:54 PM Leonard Mada via R-help
> <r-help using r-project.org> wrote:
>
> Dear Avi,
>
> Punctuation marks are used in various NLP language models. Preserving
> the "," is therefore useful in such scenarios and Regex are useful to
> accomplish this (especially if you have sufficient experience with
> such
> expressions).
>
> I observed only an odd behaviour using strsplit: the example
> string is
> constructed; but it is always wise to test a Regex expression against
> various scenarios. It is usually hard to predict what special
> cases will
> occur in a specific corpus.
>
> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
>
> stringi::stri_split("a bc,def, adef ,,gh", regex="
> |(?=,)|(?<=,)(?![ ])")
> # "a" "bc" "," "def" "," "adef" "" "," "," "gh"
>
> stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<!
> )(?=,)|(?<=,)(?![ ])")
> # "a" "bc" "," "def" "," "adef" "," "," "gh"
>
> # Expected:
> # "a" "bc" "," "def" "," "adef" "," "," "gh"
> # see 2nd instance of stringi::stri_split
>
>
> Sincerely,
>
>
> Leonard
>
>
> On 5/5/2023 11:20 PM, avi.e.gross using gmail.com wrote:
> > Leonard,
> >
> > It can be helpful to spell out your intent in English or some of
> us have to go back to the documentation to remember what some of
> the operators do.
> >
> > Your text being searched seems to be an example of items between
> comas with an optional space after some commas and in one case,
> nothing between commas.
> >
> > So what is your goal for the example, and in general? You
> mention a bit unclearly at the end some of what you expect and I
> think it would be clearer if you also showed exactly the output
> you would want.
> >
> > I saw some other replies that addressed what you wanted and am
> going to reply in another direction.
> >
> > Why do things the hard way using things like lookahead or look
> behind? Would several steps get you the result way more clearly?
> >
> > For the sake of argument, you either want what reading in a CSV
> file would supply, or something else. Since you are not simply
> splitting on commas, it sounds like something else. But what
> exactly else? Something as simple as this on just a comma produces
> results including empty strings and embedded leading or trailing
> spaces:
> >
> > strsplit("a bc,def, adef ,,gh", ",")
> > [[1]]
> > [1] "a bc" "def" " adef " "" "gh"
> >
> > That can of course be handled by, for example, trimming the
> result after unlisting the odd way strsplit returns results:
> >
> > library("stringr")
> > str_squish(unlist(strsplit("a bc,def, adef ,,gh", ",")))
> >
> > [1] "a bc" "def" "adef" "" "gh"
> >
> > Now do you want the empty string to be something else, such as
> an NA? That can be done too with another step.
> >
> > And a completely different variant can be used to read in your
> one-line CSV as text using standard overkill tools:
> >
> >> read.table(text="a bc,def, adef ,,gh", sep=",")
> > V1 V2 V3 V4 V5
> > 1 a bc def adef NA gh
> >
> > The above is a vector of texts. But if you simply want to
> reassemble your initial string cleaned up a bit, you can use paste
> to put back commas, as in a variation of the earlier example:
> >
> >> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))),
> collapse=",")
> > [1] "a bc,def,adef,,gh"
> >
> > So my question is whether using advanced methods is really
> necessary for your case, or even particularly efficient. If
> efficiency matters, often, it is better to use tools without
> regular expressions such as paste0() when they meet your needs.
> >
> > Of course, unless I know what you are actually trying to do, my
> remarks may be not useful.
> >
> >
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Leonard
> Mada via R-help
> > Sent: Thursday, May 4, 2023 5:00 PM
> > To: R-help Mailing List <r-help using r-project.org>
> > Subject: [R] Regex Split?
> >
> > Dear R-Users,
> >
> > I tried the following 3 Regex expressions in R 4.3:
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])",
> perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> perl=T)
> > # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
> >
> >
> > Is this correct?
> >
> >
> > I feel that:
> > - none should return (after "def"): ",", "";
> > - the first one could also return "", "," (but probably not; not
> fully
> > sure about this);
> >
> >
> > Sincerely,
> >
> >
> > Leonard
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >
> https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3
> > PLEASE do read the posting guide
> https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list