[R] grep() exclude certain patterns?

Peng Yu pengyu.ut at gmail.com
Tue Dec 8 23:22:48 CET 2009

On Fri, Dec 4, 2009 at 11:17 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> I am sure that you mentioned before that your are using 2.7.1, and possibly even why, but with the number of posts to this list each day and the number of different posters, I cannot keep track of what version everyone is using (well, I probably could, but I am unwilling to put in the time/effort required, and I don't expect anyone else to do it either).  Unless you remind us otherwise, we will generally assume that you are using a reasonably up to date version in answering questions.
> You will have a much easier time if you upgrade to a recent version of R, or at least get a copy of the docs for the recent version.  Even if you don't have rights on the computer you mainly use (common on university campuses), you can have a portable version.  I have a copy of R installed on a flash drive that I can run on someone else's computer without having to install anything on it, or look up a reference, etc.
> Think about how long it has been since you asked the original question, and you still don't have a usable (for you) answer.  What if instead you had done a little comparison between your version of R and the current version and phrased your question like:
> I would like to select some elements of a vector that do not match a given pattern, in R 2.10 I could use the grep function with the argument invert=TRUE, but I would like the script to be able to run on a computer that I do not control with only R version 2.7.1 which has the grep function, but not the invert argument.  Any suggestions?
> If that had been your original question (note that it shows what effort was made, what restrictions you are working under, and other details), I bet that you would have something productive by now.
> See further comments interspersed below:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Peng Yu
>> Sent: Friday, December 04, 2009 2:06 PM
>> To: r-help at stat.math.ethz.ch
>> Subject: Re: [R] grep() exclude certain patterns?
>> On Fri, Dec 4, 2009 at 2:35 PM, Greg Snow <Greg.Snow at imail.org> wrote:
>> > The invert argument seems a likely candidate, you could also do
>> perl=TRUE and use negations within the pattern (but that is probably
>> overkill for your original question).
> [snip]
>> > Could you explain to us the process that you use to search for
>> answers to your questions before posting?  You have been asking quite a
>> few questions that have answers out there if you can find them.  If you
>> tell us where you are looking (and why) then we may be able to suggest
>> some different search strategies that will help you find the answers
>> quicker.  Also knowing your thought process may help us in designing
>> future help/tutorials that cater more to people learning R for the
>> first time, things that seem obvious to those of us who have been using
>> the current documentation, apparently are not that obvious to some new
>> users (but also realize that the first place that you may think to look
>> may not even occur to some of us that learned computers in a different
>> time, see fortune(89) ).
>> For this particular problem in the original post, it is due to the
>> fact that I use an older R.
> I had hoped that you would give us a bit more about your learning process than just a list of a few help pages that you have read.  What tutorials or other documents have you read?, what classes have you taken?, etc.  Also what was your search process, where did you look? What search terms did you use? Etc.

I have read R-intro.pdf. But there are a few parts of R-intro.pdf that
did not make sense to when I was new to R, if I didn't read
R-intro.pdf in order and even if I read it in order.

Let's take section 6.3 'Data frames' as an example.

Suppose that a guy had extensive experience with at least a major
programming language (such as, C++, python, etc.), but was new to R.
He wanted to pick up R quickly by working with it. He needed to learn
data frame to finish his task under a pressure. He read R-intro.pdf.
He found that Section 6.3 is on data frame and he went on to read it.

    'A data frame is a list with class "data.frame". There are
restrictions on lists that may be made into data frames, namely...'

Does this sentence make sense to him? No. He wanted to understand what
a data frame is. But the document explains data.frame requiring the
reader know what a list is, but he didn't understand what a list was.
Many R people on the mailing list could say, he should read list
first. My response is 'why he has to?' An example of data frame can be
shown at the beginning of Section 6.3, it would be a much better
explanation than the current explanation with the reference to 'list'.

Then, the guy went on to Section 6.3.1, hoping to understand data
frame. But now 'read.table' jumps out with the reference to a later
Chapter. All he wanted was just an example of data frame to understand
it is, why the document mention something new to him?

The above problem of extensive cross reference can be found easily in
quite a number of online materials and even books on R. I'd suggest
that all the document should be written with as less cross reference
as possible.

For your questions on classes: I've never taken any classes on R.

In terms of search, I used google a lot when I learned other
languages. But google pretty much doesn't work for R. For example, if
I search 'grep r', most search links on the first result pages are not
relevant to R. Could somebody let me know some tips on how to use
google to search for relevant information by google. In my opinion,
the name of R seems cool but is bad in terms of searching, at least
with google.

> Now it is possible that you have only read the help pages (and that would explain a lot).  But the help pages are reference material, not tutorials.  Learning S/R through only the help pages is like learning statistics by only looking at the tables in the back of the textbook.

As I mentioned, many tutorials are not well written for newbies.
Although they might be considered well written for R experts. But the
experts don't need to read them.

> Would you expect that every table for the standard normal distribution would have full instructions on how to use it, the theoretical derivation of the normal, and all the formulas that it might be used with in the header?  I would not, I expect the header to have enough information that I can tell whether the table contains the area to the left of x, the area to the right of x, the area between 0 and x, or the area between -x and x (at least those are the normal tables that I have seen), but not much more.  If the text is an introductory text, then somewhere else in the book I would expect some information on how to use the table, but that should be safely assumed for more advanced texts.  If it were a math/stat theory book then I would expect some detail in the book on the theoretical derivation, but not in the table header, and not in an intro stats book.
> If we were to add all the  detail that you are asking for to the current help pages, I expect in a few years you would be complaining about the amount of already known info that you have to wade through to get to the important nuggets in a help file.

But making the information hierarchical, I don't think this is a
problem. Many languages have both a user guide and a reference, where
user guide gives some explanations in plain languages and the
reference gives some explanations on the detailed usage of functions,
etc. I think that it is better to put relevant information in a
central place rather intersperse it in multiple places, which makes it
difficult to find relevant information.

> The extra detail in many cases is in other documents (intro, tutorials, etc.)

Again, introductions and tutorials should be improved for better readability.

> A while ago, in my quest to compensate for my wife being smarter than me through education and experience, I learned that when I see someone do something differently than I would have, it could mean that (1) I know something that they don't, (2) They know something that I don't, or (3) there is more than 1 legitimate way of doing things based on different approaches or ways of thinking.  In at least 2 of those 3 cases I have an opportunity to learn something.
> You may very well have some things to add to the documentation, but until you show that you have thought about it from the other side of view, you are unlikely to be taken seriously.  But it is easy to see from the many changes that are made to the documentation that the R core team does listen and make changes when properly justified.
>> But in general, the R help and examples in the help page should be
>> improved in terms of the structure. Just as we write a paper, it is
>> better to have a hierarchical descriptions (i.e., which is similar to
>> the flow of abstract -> introduction -> maintext, in each section that
>> appears later, more detailed information should be given; but earlier
>> section should give readers general ideas.)
> The R/S documentation is some of the most structured and best structured that I have ever encountered.  Often I check a help page not to for help on that function, but because I remember that there is a related function that I cannot remember, all I want is the see also section, I don't want that section any higher on the page, it is easy to scroll to the bottom, then go back up a to find the see also section.  Other times I am just trying to remember the order of some arguments (I can never remember for segments if it is x1, x2, y1, y2; or x1, y1, x2, y2 so I do a quick help and under the usage section is exactly what I need without having to wade through descriptions that I already know.

> Had you looked at the help for a current version of grep, in the usage section you would have seen the invert argument (there is your general information), then you could have scrolled down for a longer description.
>> The current way to organizing the help is less satisfactory.
>> Description->Usage->Arguments
>> This may be good if you have already what you should look for. But if
>> you are new to it, you will be easily lost.
> Yes, that is exactly correct.  The help is a reference, not a tutorial.  Can you imagine having to read through an entire textbook every time you wanted to be reminded of 1 formula, I will read the textbook when learning the subject, but when I want one formula I want a quick way to find it (table of contents, index, separate formula sheet).
> If you have not read "An Introduction to R" and some of the other tutorials, then do so now before reading any more of the reference files.
>> For example, many
>> functions are given in Usage without been explained what the
>> difference between them until very late, or no explicit explanations
>> at all. But having such descriptions on the differences can help users
>> choose the appropriate ones.
> The writers of the documentation need to balance the content based on the number of people viewing the documentation for the first time and the number than already know the information and just need a reminder of a few details.  As you progress with R you will find yourself doing much more of the later and less of the former.

A user guide and a reference might be a better choice to balance. For
example, a package should provide a user guide on the general aspects.
But I found a lot of packages that only have a reference.

>> Some of informative examples should be put forward to help newbies
>> understand how to use each function, rather than put at the end of the
>> help page. Many examples in the help page requires previous knowledge
>> in other functions. In general, it is better to have the information
>> on each help page self contained.
> No, vignettes, tutorials, books, and overviews should be self contained, reference pages should stick to the point (with some exception).  Additional information can be linked to through the See Also section or common sense.  The last example on the help page for the 'mean' function uses the USArrests data set, should the details on this dataset be included on the page for mean?  No, that would be redundant and give extra material to be slogged through that is not needed for most cases.  Most people can figure out enough about the dataset from the context, those that want more detail can look at the help for USArrests.  Is it really that hard to click on a link in See Also, or type another help command?

The problem is still that there are too many cross references in the
current help. Let me give another example. The following lines are
from the example in help(data.frame). Why 'sample' is used in the
example? Can it just give an example with c() rather than sample()? To
a newbie, he will have to understand what sample() does before he
understands the following usage of data.frame.

    L3 <- LETTERS[1:3]
    (d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10, replace=TRUE)))

>> Another problem is not due to the help of R, but the design of R
>> itself --- there many specially case to use a function. For example,
>> x[1:2,] is a matrix but x[1,] is a vector.
>> > x=matrix(1:6,nr=3)
>> > x[1:2,]
>>      [,1] [,2]
>> [1,]    1    4
>> [2,]    2    5
>> > x[1,]
>> [1] 1 4
> R is a mixture of a programming language and an interactive tool, this leads to a certain mix of science and magic.  It is not perfectly consistent, but it works.  One of the great things about open source is that if you want to you can change the source and have your copy do this if it is that important to you.  You can then share your changes and see if others prefer your version (just make sure you distinguish your version from the original).  Note that many (if not all) the most popular languages have some 'features' that violate the rules to make certain things easier (but end up meaning that you have to be very explicit with the computer in other cases).  I am not arguing that popular = correct, or popular = best, but I would argue that popular = popular and there is probably a reason for that.
>> I know that somebody that has worked with R for over 10 years don't
>> know why (It may be because he doesn't care). But I have to ask the
>> mailing list to understand that I have to use the option 'drop' in
>> order to get a matrix as the returned value.
>> > x[1,,drop=F]
>>      [,1] [,2]
>> [1,]    1    4
> Or you could have read ?'['
> Though section 5.2 of An Introduction to R could use a reference to this, a well reasoned request to the R development list would probably be accepted.
>> If I were the original designer of R, I would make the interface more
>> orthogonal (this is the usual way to reduce complexity in software).
>> For example, [] would always return a matrix, if I want to reduce its
>> dimension, I will have another function to do so.
> Let us know when you have your program ready, let us know, if it is better people will switch to it.  But in the mean time I am going to keep using R (and if I can't get yours to do what I want as easily as R, I will keep using R).
> I believe that John Chambers has admitted that if he had known then all that he knows now, he would have done some things different (I don't know if that is on his list or not), but it is a bit late now unless someone wants to start again on a new program.  I am still waiting for my future self to use the TimeTravel package to send me a copy of that package (and the esp package) from the future.
>> Have many special cases although might be convenient in some cases.
>> But they may also cause confusions and may cause some delicate bugs
>> that are to figure out especially to newbies.
> Anything worth learning takes some effort to figure out.  I remember seeing a program once that only had one big button in the middle that did nothing.  Really easy program to use, but I prefer the effort to learn something more useful.  In fact, I hope that there is always something more for me to learn about R/S, after 22 years using variants of S, I am still not anywhere close to knowing everything.

I assume that the majority of R developers are statisticians rather
than software engineers. To a well trained software engineer, it is
clear that it can make the code more readable, more understandable and
hence better maintainable by having less special cases. You could add
an entry on 'drop' to R-intro.pdf. But if the development philosophy
is to have many special cases (and it is like that, as I have observed
in many R packages), the problem of easily causing confusions for new
users will not be solved. Although there is a lot of historical reason
why R is like the way that it currently is, I would say that it could
be better if its development is better organized. If it could be
better but it isn't, why not we do something right now to make it
better in the future than if we don't.

C++ has its standard committee, which discusses what features should
be added in the standard. A lot of developers can contribute code to
boost (www.boost.org, similar to cran), which are subjected to
reviews. Those good and stable packages in boost can added to the
standard latter on based on the decision of the committee. Maybe R
development can be improved by taking some advantages from this
development style.

>> The above are my current thoughts. Let me know if it makes sense to you
>> or not.
> Have you read "An Introduction to R"? it looks like no, but I am not sure.  Have you read any of the other tutorials, books, overviews, vignettes, etc.?  If not, do so, do so now!
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> project.org] On Behalf Of Peng Yu
>> >> Sent: Friday, December 04, 2009 12:43 PM
>> >> To: r-help at stat.math.ethz.ch
>> >> Subject: Re: [R] grep() exclude certain patterns?
>> >>
>> >> On Fri, Dec 4, 2009 at 11:54 AM, Duncan Murdoch
>> <murdoch at stats.uwo.ca>
>> >> wrote:
>> >> > On 04/12/2009 12:52 PM, Peng Yu wrote:
>> >> >>
>> >> >> The external grep program has an option -v to select non-matching
>> >> >> lines. I'm wondering if how to exclude certain patterns in grep()
>> in
>> >> >> R?
>> >> >>
>> >> >
>> >> > ?grep
>> >>
>> >> I don't see which argument to use.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/posting-
>> >> guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list