[R] extracting character values
arun
smartpink111 at yahoo.com
Sun Jan 13 18:12:32 CET 2013
HI,
Not sure this helps:
netw<-read.table(text="
lastname_initial, year
Aaron H, 1900
Beecher HW, 1947
Cannon JP, 1985
Stone WC, 1982
van der hoops bf, 1948
NA, 1976
",sep=",",header=TRUE,stringsAsFactors=FALSE)
res1<-sub("^[[:space:]]*(.*?)[[:space:]]*$","\\1",gsub("\\w+$","",netw[,1]))
res1[!is.na(res1)]
#[1] "Aaron" "Beecher" "Cannon" "Stone"
#[5] "van der hoops"
A.K.
----- Original Message -----
From: Biau David <djmbiau at yahoo.fr>
To: r help list <r-help at r-project.org>
Cc:
Sent: Sunday, January 13, 2013 3:53 AM
Subject: [R] extracting character values
Dear all,
I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res'
Here is what I do:
res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))
for (i in 1:x)
{
wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1)
}
the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters.
Someone would have a nice idea for that? Thanks,
David
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list