[R] grep lines before or after pattern matched?

Bert Gunter gunter.berton at gene.com
Mon Jul 11 21:00:50 CEST 2011


Simon:

Basic basic stuff (not grep -- the stuff thereafter) . Please read the
docs, especially the tutorial,  An Intro to R.

... and Josh's solution can be shortened to (as he knows):

index <- grep("Document+.", yourfile, value = FALSE) + c(2,4)

-- Bert

On Mon, Jul 11, 2011 at 11:19 AM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
> Try this (untested as I'm on my iPhone now):
>
> index <- grep("Document+.", yourfile, value = FALSE)
> index <- c(index + 2, index + 4)
>
> You just need to make sure you avoid recycling, e.g.,
>
> 1:10 + c(2, 4) # not what you want
>
> If you want a sufficient number of lines that manually writing index + becomes cumbersome, you could use something like:
>
> as.vector(sapply(c(2, 4), "+", e2 = index))
>
> HTH,
>
> Josh
>
> On Jul 11, 2011, at 11:09, Simon Kiss <sjkiss at gmail.com> wrote:
>
>> Josh, that's amazing. Is there any way to have it grab two different lines after the grep, say the second and the fourth line? There's some other information in the text file I'd like to grab.  I could do two separate commands, but I'd like to know if this could be done in one command...
>> Simon Kiss
>> On 2011-07-11, at 1:31 PM, Joshua Wiley wrote:
>>
>>> If you know you can find the start of the document (say that line
>>> always starts with Document...), then:
>>>
>>> grep("Document+.", yourfile, value = FALSE) + 4
>>>
>>> should give you 4 lines after each line where Document occurred.  No
>>> loop needed :)
>>>
>>> On Mon, Jul 11, 2011 at 10:25 AM, Simon Kiss <sjkiss at gmail.com> wrote:
>>>> Hi Josh,
>>>> Sorry for the insufficient introduction. This might work, but I'm not sure.
>>>> The file that I have includes up to 100 documents (Document 1, Document 2, Document 3....Document 100) with the newspaper name following 4 lines below each Document number.
>>>> I'm using readlines to get the text file into R and then trying to use grep to get the newspaper name for each record. But your idea of indexing the text object read into R with the line number where the newspaper name is found is a good one.  I'll just have to come up with a loop to tell R to get the 4th, 8th, 12, 16th, line, etc.
>>>> I'll see if I can get that to work.
>>>> Simon
>>>> On 2011-07-11, at 12:45 PM, Joshua Wiley wrote:
>>>>
>>>>> Dear Simon,
>>>>>
>>>>> Maybe I don't understand properly....if you are doing this in R, can't
>>>>> you just pick the line you want?
>>>>>
>>>>> Josh
>>>>>
>>>>> ## print your data to clipboard
>>>>> cat("Document 1 of 100 \n \n \n Newspaper Name \n \n Day Date", file =
>>>>> "clipboard")
>>>>> ## read data in, and only select the 4th line to pass to grep()
>>>>> grep("pattern", x = readLines("clipboard")[4])
>>>>>
>>>>>
>>>>> On Mon, Jul 11, 2011 at 9:31 AM, Simon Kiss <sjkiss at gmail.com> wrote:
>>>>>> Dear colleagues,
>>>>>> I have a series of newspaper articles in a text file, downloaded from a text file.  They look as follows:
>>>>>>
>>>>>> Document 1 of 100
>>>>>> \n
>>>>>> \n
>>>>>> \n
>>>>>> Newspaper Name
>>>>>> \n
>>>>>> \n
>>>>>> Day Date
>>>>>>
>>>>>> I have a series of grep scripts that can extract the date and convert it to a date object, but I can't figure out how to grep the newspaper name.  There is no field ID attached to those lines. The best I can come up with would be to have the program grep the four lines following matching the pattern "Document [0-9]".  There is an an argument to grep in unix that can do this ...grep -A4 'pattern' infile>outfile, but I don't know if there is an equivalent argument in R.
>>>>>>
>>>>>> Any thoughts.
>>>>>> Yours, Simon Kiss
>>>>>> *********************************
>>>>>> Simon J. Kiss, PhD
>>>>>> Assistant Professor, Wilfrid Laurier University
>>>>>> 73 George Street
>>>>>> Brantford, Ontario, Canada
>>>>>> N3T 2C9
>>>>>> Cell: +1 905 746 7606
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joshua Wiley
>>>>> Ph.D. Student, Health Psychology
>>>>> University of California, Los Angeles
>>>>> https://joshuawiley.com/
>>>>
>>>> *********************************
>>>> Simon J. Kiss, PhD
>>>> Assistant Professor, Wilfrid Laurier University
>>>> 73 George Street
>>>> Brantford, Ontario, Canada
>>>> N3T 2C9
>>>> Cell: +1 905 746 7606
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Joshua Wiley
>>> Ph.D. Student, Health Psychology
>>> University of California, Los Angeles
>>> https://joshuawiley.com/
>>
>> *********************************
>> Simon J. Kiss, PhD
>> Assistant Professor, Wilfrid Laurier University
>> 73 George Street
>> Brantford, Ontario, Canada
>> N3T 2C9
>> Cell: +1 905 746 7606
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
"Men by nature long to get on to the ultimate truths, and will often
be impatient with elementary studies or fight shy of them. If it were
possible to reach the ultimate truths without the elementary studies
usually prefixed to them, these would not be preparatory studies but
superfluous diversions."

-- Maimonides (1135-1204)

Bert Gunter
Genentech Nonclinical Biostatistics



More information about the R-help mailing list