[R] Regular Expressions + Matrices

Rui Barradas ruipbarradas at sapo.pt
Fri Aug 10 20:35:40 CEST 2012


Hello,

My code doesn't predict a point you've made clear in this post. Inline.
Em 10-08-2012 19:05, Fred G escreveu:
> Thanks Arun. The only issue is that I need the code to be very
> generalizable, such that the grep() really has to be if the first string up
> to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit
> below) is the same as the first string up to the whitespace in the row
> directly below it

Does this mean that "New York" ---> "New" in one row shouldn't match 
"Other New" in the next row because "New" is not the first string up to 
the whitespace? If this is the case, modify my earlier code to


fun <- function(i, x){
     if(x[i, "ID"] != x[i + 1, "ID"]){
         s1 <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]     # 
keep first string
         s2 <- unlist(strsplit(x[i + 1, "NAME"], "[[:space:]]"))[1]  # 
keep first string
         if(grepl(s1, s2)) return(TRUE)
     }
     FALSE
}

If it isn't the case, do nothing.

Rui Barradas

> , AND the ID's are different, then copy.  The actual file
> has thousands of different IDs and names...
>
> On Fri, Aug 10, 2012 at 2:01 PM, arun <smartpink111 at yahoo.com> wrote:
>
>>
>> Hi,
>>
>> Try this:
>> dat1<-read.table(text="
>> ID,    NAME,    YEAR,    SOURCE
>> 1,    New York Mets,    1900,    ESPN
>> 2,    New York Yankees,    1920,    Cooperstown
>> 3,    Boston Redsox,    1918,    ESPN
>> 4,    Washington Nationals,    2010,    ESPN
>> 5,    Detroit Tigers,    1990,    ESPN
>> ",sep=",",header=TRUE,stringsAsFactors=FALSE)
>>
>>   index<-grep("New York.*",dat1$NAME)
>> dat1[index,]
>> #  ID             NAME YEAR      SOURCE
>> #1  1    New York Mets 1900        ESPN
>> #2  2 New York Yankees 1920 Cooperstown
>>
>> A.K.
>>
>>
>>
>> ----- Original Message -----
>> From: Fred G <bayespokerguy at gmail.com>
>> To: r-help at r-project.org
>> Cc:
>> Sent: Friday, August 10, 2012 1:41 PM
>> Subject: [R] Regular Expressions + Matrices
>>
>> Hi all,
>>
>> My code looks like the following:
>> inname = read.csv("ID_error_checker.csv", as.is=TRUE)
>> outname = read.csv("output.csv", as.is=TRUE)
>>
>> #My algorithm is the following:
>> #for line in inname
>> #if first string up to whitespace in row in inname$name = first string up
>> to whitespace in row + 1 in inname$name
>> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
>> below it
>> #copy these two lines to a new file
>>
>> In other words, if the name (up to the first whitespace) in the first row
>> equals the name in the second row (etc for whole file) and the ID in the
>> first row does not equal the ID in the second row, copy both of these rows
>> in full to a new file.  Only caveat is that I want a regular expression not
>> to take the full names, but just the first string up to the first
>> whitespace in the inname$name column (ie if row1 has a name of: New York
>> Mets and row2 has a name of New York Yankees, I would want both of these
>> rows to be copied in full since "New" is the same in both...)
>>
>> Here is some example data:
>> ID NAME                          YEAR     SOURCE     NOTES
>> 1  New York Mets               1900      ESPN
>> 2  New York Yankees          1920     Cooperstown
>> 3  Boston Redsox               1918      ESPN
>> 4  Washington Nationals      2010     ESPN
>> 5  Detroit Tigers                  1990      ESPN
>>
>> The desired output would be:
>> ID   NAME                    YEAR SOURCE
>> 1    New York Mets        1900   ESPN
>> 2    New York Yankees   1920   Cooperstown
>>
>> Thanks so much!
>>
>>      [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list