[R] spliting first 10 words in a string

David Winsemius dwinsemius at comcast.net
Mon Nov 1 23:32:17 CET 2010


On Nov 1, 2010, at 5:52 PM, Phil Spector wrote:

>  -
>   Does this example do what you want?
>
>> mysentences = c('Here is a sentence that has a bunch of words in  
>> it','Here is another sentence that also has a bunch of words','I  
>> have yet another sentence and it also has a whole bunch of words')
>> data.frame(mysentences,do.call(rbind,lapply(strsplit(mysentences,'  
>> +'),'[',1:10)))
>                                                          
> mysentences   X1   X2
> 1                 Here is a sentence that has a bunch of words in it  
> Here   is
> 2            Here is another sentence that also has a bunch of words  
> Here   is
> 3 I have yet another sentence and it also has a whole bunch of  
> words    I have
>       X3       X4       X5   X6  X7    X8    X9   X10
> 1       a sentence     that  has   a bunch    of words
> 2 another sentence     that also has     a bunch    of
> 3     yet  another sentence  and  it  also   has     a

Matevž;

Be on the alert for what the data.frame function does with character  
vectors. Unless you forbid it from doing so it will convert any  
character vector to a factor. (A major source of confusion for R- 
newbies.) In the above version you could prevent this in Phil's  
solution by:

data.frame(mysentences,do.call(rbind,lapply(strsplit(mysentences,'  
+'),'[',1:10)), stringsAsFactors=FALSE)

Or if cbind were applied to my solution at the end of this email:

cbind(worddf, t(sapply(strsplit(worddf$words, " "), "[", 1:10) ) ,  
stringsAsFactors=FALSE)
 > str( cbind(worddf, t(sapply(strsplit(worddf$words, " "), "[",  
1:10) ) , stringsAsFactors=FALSE) )
'data.frame':	3 obs. of  11 variables:
  $ words: chr  "I have a columnn with text that has quite a few words  
in it." "I would like to split these words in separate columns" "but  
just first ten words in the string. Is that possible in R?"
  $ 1    : chr  "I" "I" "but"
  $ 2    : chr  "have" "would" "just"
  $ 3    : chr  "a" "like" "first"
  $ 4    : chr  "columnn" "to" "ten"
  $ 5    : chr  "with" "split" "words"
  $ 6    : chr  "text" "these" "in"
  $ 7    : chr  "that" "words" "the"
  $ 8    : chr  "has" "in" "string."
  $ 9    : chr  "quite" "separate" "Is"
  $ 10   : chr  "a" "columns" "that"

cbind.data.frame is a method that would be invoked for that operation.  
This result has the disadvantage that the column names will need to be  
enclosed in quotes to access them with the "$" function since they  
start with numerals.

(Or you could just deal with the factor type.)

--
David.

>
> 					- Phil Spector
> 					 Statistical Computing Facility
> 					 Department of Statistics
> 					 UC Berkeley
> 					 spector at stat.berkeley.edu
>
>
> On Mon, 1 Nov 2010, Matevž Pavlič wrote:
>
>> ...I would like i.e. split this sentence from field Opis in  
>> data.frame :
>>
>> Opis : "I have a sentense with ten words", so that it would conver  
>> to something like this :
>>
>> Opis : "I have a sentense with then words"; Column1 : "I";  
>> Column2 : "have"; Column3 : "a"; Column4 : "sentense"; Column5:  
>> "with"; Column6 :"ten";column7:"words"
>>
>> ....or in data.frame something like this (as I understand) :
>>
>> data.frame':   xx obs. of  12 variables:
>> $ Opis : factor :"I have a sentense with then words";
>> $ Column1 : factor  "I";
>> $ Column2 : factor "have";
>> $ Column3 : factor "a";
>> $ Column4 : factor "sentense";
>> $ Column5: factor "with";
>> $ Column6 : factor "ten";
>> $ Column7: factor"words"
>>
>> Hope that explains it better, I am still having some troubles  
>> understanding R and all..
>> m
>>
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>> ] On Behalf Of Matevž Pavlič
>> Sent: Monday, November 01, 2010 10:34 PM
>> To: David Winsemius
>> Cc: r-help at r-project.org
>> Subject: Re: [R] spliting first 10 words in a string
>>
>> Hi,
>>
>> I am sorry, will try to be more exact from now on...
>>
>> I have a data.frame  with a field called Opis. IT contains  
>> sentenses that I would like to split in words or fields in  
>> data.frame...when I say columns I mean as in Excel table. I would  
>> like to split "Opis" into ten fields from the first ten words in  
>> Opis field.
>> Here is an example of my data.frame.
>>
>> 'data.frame':   22928 obs. of  12 variables:
>> $ VrtinaID        : int  1 1 1 1 2 2 2 2 2 2 ...
>> $ ZapStev         : int  1 2 3 4 1 2 3 4 5 6 ...
>> $ GlobinaOd       : num  0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
>> $ GlobinaDo       : num  0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
>> $ Opis            : Factor w/ 12754 levels "","(MIVKA) DROBEN  
>> MELJAST PESEK, GOST, SIVORJAV",..: 2060 11588 2477 11660 7539 3182  
>> 7884 9123 2500 4756 ...
>> $ ACklasifikacija : Factor w/ 290 levels "","(CL)","(CL)/(SC)",..:  
>> 154 125 101 101 NA 106 125 80 106 101 ...
>> $ GeolNastOd      : num  0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
>> $ GeolNastDo      : num  0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
>> $ GeolNastOpis    : Factor w/ 113 levels "","B. M. S.",..: 56 53 53  
>> 53 56 53 53 53 53 53 ...
>> $ NacinVrtanjaOd  : num  0e+00 1e+09 1e+09 1e+09 0e+00 ...
>> $ NacinVrtanjaDo  : num  1.1e+01 1.0e+09 1.0e+09 1.0e+09 1.0e+01 ...
>> $ NacinVrtanjaOpis: Factor w/ 43 levels "","H. N.","IZKOP",..: 26 1  
>> 1 1 26 1 1 1 1 1 ...
>>
>> Hope that explains better...
>> Thank you, m
>>
>> -----Original Message-----
>> From: David Winsemius [mailto:dwinsemius at comcast.net]
>> Sent: Monday, November 01, 2010 10:13 PM
>> To: Matevž Pavlič
>> Cc: r-help at r-project.org
>> Subject: Re: [R] spliting first 10 words in a string
>>
>>
>> On Nov 1, 2010, at 4:39 PM, Matevž Pavlič wrote:
>>
>>> Hi all,
>>>
>>>
>>>
>>> I have a columnn with text that has quite a few words in it. I would
>>> like to split these words in separate columns, but just first ten
>>> words in the string. Is that possible in R?
>>>
>>>
>>
>> Not sure what a column means to you. It's not a precisely defined R
>> type or class. (And you are requested to offered a concrete example
>> rather than making us guess.)
>>
>> >words <-"I have a columnn with text that has quite a few words in
>> it. I would like to split these words in separate columns, but just
>> first ten words in the string. Is that possible in R?"
>>
>> > strsplit(words, " ")[[1]][1:10]
>> [1] "I"       "have"    "a"       "columnn" "with"    "text"
>> "that"    "has"     "quite"   "a"
>>
>>
>> Or if in a dataframe:
>>
>> > words <-c("I have a columnn with text that has quite a few words in
>> it.",   "I would like to split these words in separate columns", "but
>> just first ten words in the string. Is that possible in R?")
>> > worddf <- data.frame(words=words)
>>
>> > t(sapply(strsplit(worddf$words, " "), "[", 1:10) )
>>     [,1]  [,2]    [,3]    [,4]      [,5]    [,6]    [,7]    [,
>> 8]      [,9]       [,10]
>> [1,] "I"   "have"  "a"     "columnn" "with"  "text"  "that"  "has"
>> "quite"    "a"
>> [2,] "I"   "would" "like"  "to"      "split" "these" "words" "in"
>> "separate" "columns"
>> [3,] "but" "just"  "first" "ten"     "words" "in"    "the"    
>> "string."
>> "Is"       "that"
>>
>>
>> -- 
>> David Winsemius, MD
>> West Hartford, CT
>>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list