[R] grep lines before or after pattern matched?

Mon Jul 11 21:53:40 CEST 2011

On Jul 11, 2011, at 3:33 PM, Joshua Wiley wrote:

> On Jul 11, 2011, at 12:00, Bert Gunter <gunter.berton at gene.com> wrote:
>
>> Simon:
>>
>> Basic basic stuff (not grep -- the stuff thereafter) . Please read  
>> the
>> docs, especially the tutorial,  An Intro to R.
>>
>> ... and Josh's solution can be shortened to (as he knows):
>>
>> index <- grep("Document+.", yourfile, value = FALSE) + c(2,4)
>>
>
> Really?  Won't the 2 and 4 get recycled so that every other element  
> returned from grep will have 2 or 4 added instead of 2 *and* 4?
>
> My understanding is that Simon has a single file with for example  
> Document 1 on line 1 Document 2 on line 301 etc. And he wants both  
> the 2nd and 4th lines after each document, so lines 3, 5, 303, 305  
> but just doing + c(2,4) would only give 3, 305.

So:

rep(index, each=2) + c(2,4)

-- 
David.

>
> Josh
>
>> -- Bert
>>
>> On Mon, Jul 11, 2011 at 11:19 AM, Joshua Wiley <jwiley.psych at gmail.com 
>> > wrote:
>>> Try this (untested as I'm on my iPhone now):
>>>
>>> index <- grep("Document+.", yourfile, value = FALSE)
>>> index <- c(index + 2, index + 4)
>>>
>>> You just need to make sure you avoid recycling, e.g.,
>>>
>>> 1:10 + c(2, 4) # not what you want
>>>
>>> If you want a sufficient number of lines that manually writing  
>>> index + becomes cumbersome, you could use something like:
>>>
>>> as.vector(sapply(c(2, 4), "+", e2 = index))
>>>
>>> HTH,
>>>
>>> Josh
>>>
>>> On Jul 11, 2011, at 11:09, Simon Kiss <sjkiss at gmail.com> wrote:
>>>
>>>> Josh, that's amazing. Is there any way to have it grab two  
>>>> different lines after the grep, say the second and the fourth  
>>>> line? There's some other information in the text file I'd like to  
>>>> grab.  I could do two separate commands, but I'd like to know if  
>>>> this could be done in one command...
>>>> Simon Kiss
>>>> On 2011-07-11, at 1:31 PM, Joshua Wiley wrote:
>>>>
>>>>> If you know you can find the start of the document (say that line
>>>>> always starts with Document...), then:
>>>>>
>>>>> grep("Document+.", yourfile, value = FALSE) + 4
>>>>>
>>>>> should give you 4 lines after each line where Document  
>>>>> occurred.  No
>>>>> loop needed :)
>>>>>
>>>>> On Mon, Jul 11, 2011 at 10:25 AM, Simon Kiss <sjkiss at gmail.com>  
>>>>> wrote:
>>>>>> Hi Josh,
>>>>>> Sorry for the insufficient introduction. This might work, but  
>>>>>> I'm not sure.
>>>>>> The file that I have includes up to 100 documents (Document 1,  
>>>>>> Document 2, Document 3....Document 100) with the newspaper name  
>>>>>> following 4 lines below each Document number.
>>>>>> I'm using readlines to get the text file into R and then trying  
>>>>>> to use grep to get the newspaper name for each record. But your  
>>>>>> idea of indexing the text object read into R with the line  
>>>>>> number where the newspaper name is found is a good one.  I'll  
>>>>>> just have to come up with a loop to tell R to get the 4th, 8th,  
>>>>>> 12, 16th, line, etc.
>>>>>> I'll see if I can get that to work.
>>>>>> Simon
>>>>>> On 2011-07-11, at 12:45 PM, Joshua Wiley wrote:
>>>>>>
>>>>>>> Dear Simon,
>>>>>>>
>>>>>>> Maybe I don't understand properly....if you are doing this in  
>>>>>>> R, can't
>>>>>>> you just pick the line you want?
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>> ## print your data to clipboard
>>>>>>> cat("Document 1 of 100 \n \n \n Newspaper Name \n \n Day  
>>>>>>> Date", file =
>>>>>>> "clipboard")
>>>>>>> ## read data in, and only select the 4th line to pass to grep()
>>>>>>> grep("pattern", x = readLines("clipboard")[4])
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 11, 2011 at 9:31 AM, Simon Kiss <sjkiss at gmail.com>  
>>>>>>> wrote:
>>>>>>>> Dear colleagues,
>>>>>>>> I have a series of newspaper articles in a text file,  
>>>>>>>> downloaded from a text file.  They look as follows:
>>>>>>>>
>>>>>>>> Document 1 of 100
>>>>>>>> \n
>>>>>>>> \n
>>>>>>>> \n
>>>>>>>> Newspaper Name
>>>>>>>> \n
>>>>>>>> \n
>>>>>>>> Day Date
>>>>>>>>
>>>>>>>> I have a series of grep scripts that can extract the date and  
>>>>>>>> convert it to a date object, but I can't figure out how to  
>>>>>>>> grep the newspaper name.  There is no field ID attached to  
>>>>>>>> those lines. The best I can come up with would be to have the  
>>>>>>>> program grep the four lines following matching the pattern  
>>>>>>>> "Document [0-9]".  There is an an argument to grep in unix  
>>>>>>>> that can do this ...grep -A4 'pattern' infile>outfile, but I  
>>>>>>>> don't know if there is an equivalent argument in R.
>>>>>>>>
>

David Winsemius, MD
West Hartford, CT