[R] grep lines before or after pattern matched?
David Winsemius
dwinsemius at comcast.net
Mon Jul 11 21:53:40 CEST 2011
On Jul 11, 2011, at 3:33 PM, Joshua Wiley wrote:
> On Jul 11, 2011, at 12:00, Bert Gunter <gunter.berton at gene.com> wrote:
>
>> Simon:
>>
>> Basic basic stuff (not grep -- the stuff thereafter) . Please read
>> the
>> docs, especially the tutorial, An Intro to R.
>>
>> ... and Josh's solution can be shortened to (as he knows):
>>
>> index <- grep("Document+.", yourfile, value = FALSE) + c(2,4)
>>
>
> Really? Won't the 2 and 4 get recycled so that every other element
> returned from grep will have 2 or 4 added instead of 2 *and* 4?
>
> My understanding is that Simon has a single file with for example
> Document 1 on line 1 Document 2 on line 301 etc. And he wants both
> the 2nd and 4th lines after each document, so lines 3, 5, 303, 305
> but just doing + c(2,4) would only give 3, 305.
So:
rep(index, each=2) + c(2,4)
--
David.
>
> Josh
>
>> -- Bert
>>
>> On Mon, Jul 11, 2011 at 11:19 AM, Joshua Wiley <jwiley.psych at gmail.com
>> > wrote:
>>> Try this (untested as I'm on my iPhone now):
>>>
>>> index <- grep("Document+.", yourfile, value = FALSE)
>>> index <- c(index + 2, index + 4)
>>>
>>> You just need to make sure you avoid recycling, e.g.,
>>>
>>> 1:10 + c(2, 4) # not what you want
>>>
>>> If you want a sufficient number of lines that manually writing
>>> index + becomes cumbersome, you could use something like:
>>>
>>> as.vector(sapply(c(2, 4), "+", e2 = index))
>>>
>>> HTH,
>>>
>>> Josh
>>>
>>> On Jul 11, 2011, at 11:09, Simon Kiss <sjkiss at gmail.com> wrote:
>>>
>>>> Josh, that's amazing. Is there any way to have it grab two
>>>> different lines after the grep, say the second and the fourth
>>>> line? There's some other information in the text file I'd like to
>>>> grab. I could do two separate commands, but I'd like to know if
>>>> this could be done in one command...
>>>> Simon Kiss
>>>> On 2011-07-11, at 1:31 PM, Joshua Wiley wrote:
>>>>
>>>>> If you know you can find the start of the document (say that line
>>>>> always starts with Document...), then:
>>>>>
>>>>> grep("Document+.", yourfile, value = FALSE) + 4
>>>>>
>>>>> should give you 4 lines after each line where Document
>>>>> occurred. No
>>>>> loop needed :)
>>>>>
>>>>> On Mon, Jul 11, 2011 at 10:25 AM, Simon Kiss <sjkiss at gmail.com>
>>>>> wrote:
>>>>>> Hi Josh,
>>>>>> Sorry for the insufficient introduction. This might work, but
>>>>>> I'm not sure.
>>>>>> The file that I have includes up to 100 documents (Document 1,
>>>>>> Document 2, Document 3....Document 100) with the newspaper name
>>>>>> following 4 lines below each Document number.
>>>>>> I'm using readlines to get the text file into R and then trying
>>>>>> to use grep to get the newspaper name for each record. But your
>>>>>> idea of indexing the text object read into R with the line
>>>>>> number where the newspaper name is found is a good one. I'll
>>>>>> just have to come up with a loop to tell R to get the 4th, 8th,
>>>>>> 12, 16th, line, etc.
>>>>>> I'll see if I can get that to work.
>>>>>> Simon
>>>>>> On 2011-07-11, at 12:45 PM, Joshua Wiley wrote:
>>>>>>
>>>>>>> Dear Simon,
>>>>>>>
>>>>>>> Maybe I don't understand properly....if you are doing this in
>>>>>>> R, can't
>>>>>>> you just pick the line you want?
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>> ## print your data to clipboard
>>>>>>> cat("Document 1 of 100 \n \n \n Newspaper Name \n \n Day
>>>>>>> Date", file =
>>>>>>> "clipboard")
>>>>>>> ## read data in, and only select the 4th line to pass to grep()
>>>>>>> grep("pattern", x = readLines("clipboard")[4])
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 11, 2011 at 9:31 AM, Simon Kiss <sjkiss at gmail.com>
>>>>>>> wrote:
>>>>>>>> Dear colleagues,
>>>>>>>> I have a series of newspaper articles in a text file,
>>>>>>>> downloaded from a text file. They look as follows:
>>>>>>>>
>>>>>>>> Document 1 of 100
>>>>>>>> \n
>>>>>>>> \n
>>>>>>>> \n
>>>>>>>> Newspaper Name
>>>>>>>> \n
>>>>>>>> \n
>>>>>>>> Day Date
>>>>>>>>
>>>>>>>> I have a series of grep scripts that can extract the date and
>>>>>>>> convert it to a date object, but I can't figure out how to
>>>>>>>> grep the newspaper name. There is no field ID attached to
>>>>>>>> those lines. The best I can come up with would be to have the
>>>>>>>> program grep the four lines following matching the pattern
>>>>>>>> "Document [0-9]". There is an an argument to grep in unix
>>>>>>>> that can do this ...grep -A4 'pattern' infile>outfile, but I
>>>>>>>> don't know if there is an equivalent argument in R.
>>>>>>>>
>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list