[R] Search and extract string function

Marc Schwartz marc_schwartz at me.com
Thu Jul 15 23:17:04 CEST 2010


On Jul 15, 2010, at 11:27 AM, AndrewPage wrote:

> 
> Actually I have one more question that's somewhat related-- I'm starting out
> by importing a .txt file that isn't divided into vectors and is at times
> inconsistent with regards to spacing, indents, etc., so I can't rely on
> those.  It looks something like this:
> 
> 
> "Drink=Coffee:Location=Office:Time=Morning:Market=Flat 
> 
> Drink=Water:Location=Office:Time=Afternoon:Market=Up 
> 
> Drink=Water:Location=Gym:Time=Evening:Market=Closed 
> Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed 
>           Drink=Coffee:Location=Office:Time=Morning:Market=Flat 
> Drink=Water:Location=Office:Time=Afternoon:Market=Up 
> 
>    Drink=Water:Location=Gym:Time=Evening:Market=Closed 
> Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed
> Drink=Coffee:Location=Office:Time=Morning:Market=Flat 
> 
> Drink=Water:Location=Office:Time=Afternoon:Market=Up 
> 
> Drink=Water:Location=Gym:Time=Evening:Market=Closed 
> 
> Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"
> 
> 
> 
> How can I take a single string like this and divide it into twelve vectors,
> like this:
> 
> FixedData
> [1] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat"        
> [2] "Drink=Water:Location=Office:Time=Afternoon:Market=Up"         
> [3] "Drink=Water:Location=Gym:Time=Evening:Market=Closed"          
> [4] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"
> [5] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat"        
> [6] "Drink=Water:Location=Office:Time=Afternoon:Market=Up"         
> [7] "Drink=Water:Location=Gym:Time=Evening:Market=Closed"          
> [8] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"
> [9] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat"        
> [10] "Drink=Water:Location=Office:Time=Afternoon:Market=Up"         
> [11] "Drink=Water:Location=Gym:Time=Evening:Market=Closed"          
> [12] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"
> 
> Thanks again for all of the help!


If each of the text lines in the file are in fact on a separate line, then they will be split up by carriage return/line feed sequences (CR/LF) and can be read by R on a line by line basis using readLines().

Having done so, by copying the above from the clipboard, I get the following, presuming that the quotes are not part of the file input:

> Lines
 [1] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat "          
 [2] ""                                                                
 [3] "Drink=Water:Location=Office:Time=Afternoon:Market=Up "           
 [4] ""                                                                
 [5] "Drink=Water:Location=Gym:Time=Evening:Market=Closed "            
 [6] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed "  
 [7] "          Drink=Coffee:Location=Office:Time=Morning:Market=Flat "
 [8] "Drink=Water:Location=Office:Time=Afternoon:Market=Up "           
 [9] ""                                                                
[10] "   Drink=Water:Location=Gym:Time=Evening:Market=Closed "         
[11] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"   
[12] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat "          
[13] ""                                                                
[14] "Drink=Water:Location=Office:Time=Afternoon:Market=Up "           
[15] ""                                                                
[16] "Drink=Water:Location=Gym:Time=Evening:Market=Closed "            
[17] ""                                                                
[18] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"   



Even with this irregular structure, you can still use:

Res1 <- gsub(".*Location=(.+):Time=.*", "\\1", Lines)

> Res1
 [1] "Office"     ""           "Office"     ""           "Gym"       
 [6] "Restaurant" "Office"     "Office"     ""           "Gym"       
[11] "Restaurant" "Office"     ""           "Office"     ""          
[16] "Gym"        ""           "Restaurant"


I can get rid of the blanks by using:

> Res1[Res1 != ""]
 [1] "Office"     "Office"     "Gym"        "Restaurant" "Office"    
 [6] "Office"     "Gym"        "Restaurant" "Office"     "Office"    
[11] "Gym"        "Restaurant"


If you do want to get just the fixed data as you have above:

# Get rid of all spaces
Res2 <- gsub(" +", "", Lines)

# get rid of blank lines
> Res2[Res2 != ""]
 [1] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat"        
 [2] "Drink=Water:Location=Office:Time=Afternoon:Market=Up"         
 [3] "Drink=Water:Location=Gym:Time=Evening:Market=Closed"          
 [4] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"
 [5] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat"        
 [6] "Drink=Water:Location=Office:Time=Afternoon:Market=Up"         
 [7] "Drink=Water:Location=Gym:Time=Evening:Market=Closed"          
 [8] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"
 [9] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat"        
[10] "Drink=Water:Location=Office:Time=Afternoon:Market=Up"         
[11] "Drink=Water:Location=Gym:Time=Evening:Market=Closed"          
[12] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"

HTH,

Marc



More information about the R-help mailing list